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(54) RECONSTRUCTIBLE SIMD CO-PROCESSOR STRUCTURE FOR SUMMING AND 
SYMMETRICAL FILTRATION OF ABSOLUTE DIFFERENCE 

(57)Abstract: 

PROBLEM TO BE SOLVED: To provide an image 

processing peripheral device which efficiently makes 

various calculations for image processing 

SOLUTION: Proposed architecture is incorporated as a 

coprocessor 140 in a digital signal processor(DSP) and 

assists in the calculation of the total of absolute 

differences, symmetrical row/column FIR filtration having! 

a down sampling (or up sampling) option, row/column 

discrete DGT/IDCT, and genera! algebraic functions,, 
This architecture is composed of 8 multiplication 
accumulation hardware units, which are connected in 
parallel and have their paths selected and depends upon 
a DMA controller 120 to retrieve and write back data 
from and to a DSP memory without having a DSP core 
1 10 intervene. The DSP after setting up DMA transfer 
and iPP/DMA synchronism in advance, moves to its 
process task Furthermore, the DSP can be 
synchronized with IPP architecture to transfer and can 
synchronize data by itself 
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sr«M4 0l4. H L *t<ri.. ST^dUXACD2^J5i± 

WSI I PP/\— K^x7 ■ a^n-fe^-y- 1 4 0*ftj\&>£ 

iU, ffiftARTttl PPA-F0I7' =i -0- 
1 4 03tf3O©«SBA B C^ff5t«c Cill6a> 

***rc*«fci\, HttA^Ftti ppa- K^7- a? 
n-tyf 1 4 0(4, *1\ f-*©^0y^:Slta 
aiAgfrS* - I4±3£ L. £^ icfrfrft £ * 
yZMm^Ztn-by V ' =3 71 1 0ft.. fig, *fc[*. 
iSS^^U "7*-&xi3fln 2 0©JW:j;oT„ A A 
x-^^x-^ 11 j*^ y 1 4 5 Jen- Kf-^o iM^ft 

**ift<fc.. *Hj4rF£ i pp/x-KOx? 3^Pty 

ti4on, «aiAsm.v -jriBs**ifc^ 

^y 1 4 5©»#ic*&*.fctt**MLTE»r*« H 

SIsSJDJitfailT. S««rT«I PPa-^17 - =r 
^P-&^it1 4 0f=, >^E'J 1 4 5(=Efi4HT^5x 
-*K*tLT«ftB*fr*>i*. tfcSSS^y 1 4 5 
l=JI£i*£ 0 ffilfeA^Hfrf*.. JWBBHfl&x--* ■ ?n 



^17 " 3^D-fe-2/-tH 4 0fc *^y 1 4 5|^©f™ 

fj 1 4 5 [zmt» iic^^p ?*iMXf4, a 

[0 0 2 3] dfflfflfrSK ?n*£^©:&JS14**#fr 

3o«a b c& -»(- wsmoy-rju 

©AA/ffl** --f *<fciM* L t«t < «l* 1 otf>&®0 

rio«*Effl— »ttfc*tfi?««wiiett**«iei=«c-s. * 
tiiajstT?66 5* fWMW«ii ppn-HWii 

«»r=tt ft r » 6 o • -Oxr'J *-l4R 6*IUMt! f4 

■if * l=fWHS««tt «B « (Jit* - <t 14 C©2h-A 
-^y K«9Wf* a Z5r*-H.. ««lftl=Sf 

©a«a»w-a - i * r«asH k jmssj y a r « 

[0 0 2 4] a^C0«iffiA B Cffl|caiR**Lfc^n 

1 1 0lcll"siar6*Lfc*X^6J;tf^#i6©*tL-P*i 

a>tt ssft t *js * ft & it ft r* s 6 ft i ^ aa « (= (4 . . 

fY SJ#^«»^P-bs* ■ 37 1 1 0£ffiU&RTt£ I 

-ks#i*. c*n*jiao«i«jBt*fc6"r > ipp 

icH^t^t^ill^ I PP^^nt^i 4 0 
SDSP371 1 o©Jta*a:*f=ttS^-ft* i 
PP6«DSPj;yt4M^4:*lt.. Ilf^itSlii 

PP|C#H<£8 0%4Sjy Str DS PlZ^H<£20% 

$91 yats^t^ftB, 5 s^^ttx t- k 7^* 

3ftWWttt,rc*3bftCi35«TS4*i6»^r=t4.. x< 
^JUg-^=/Pir y^ - =i7 1 1 0tS*fi8RT«Ei PP 

r^. fflPESE*raSi PPA-h^i7 ^^Pt'ytu 



(3) 



SS200 1 -23649 6 



if^^C^S*,^ [0 0 2 6] 

[0 0 2 5] Sflt*«IiSl PP/\-K«5l7 ' P^P-fe [82] 

Rcceive_data_5yiich£oiii2ation (signal. ttUtffctse), or wsii^uatHjlgaal 

5eod _datajyucliroflization (signal, true/false), or assen sigaal 

Synchronizadon_CQffipletiari (signal, true/false) „ or as$ftft_sifnal 

Callj!ib*3utine(8tibr0uJSae_addr> 

RctuniO 

RcaeiO 



Wrifejpaianieiar(paxametBr i value) 

ioo2 7] zftbmmmmit f^y^;Mit^a 

*y*'371 10 £KfMtqfllft I P pa- K^x7 

wtl 1»BttS«T—*H)B** (receive 
_data_synchronization command) ccO^% 
f4. il^-^^flnTT' (wait unt i l signal coirmand) t 

7 S * X HIM 2 0 f = J: b X M S ft 4 * BB 
SE*Tffll*&ft4T?*43o fr/^^nt^ 
"□71 101^ BBE^*U ■ 7*-b^B»i 2 0trfh 

LxArfc k»* -t K7 sr « ^ 1 1= * yjaa 

£*J»-r4 fl f^f^JUf^ai?^^ =»7 1 t oit 

[0 0 2S] muCRTfti ppa-K->i7 ■ a?a-b 
5Fi#'i 4or±, 1 4 1 i=B«4ftTL*4ft 

*s*Aft*BL*3t*SHt'r4 0 sg^-^Pim^ 
f=gj*#-&& mna?tt] ppa-wx j ■ 

aiMMMiM^y -7^ i fexH»i 2o^baw 

4$T*7-f hVM*SH=&4T?ft4 3. SBS^'J - 7£ 

-trxm^i 2 o t- 9 m&£mmtz> c * 
5=-^>*^i Lxasi#i=ai6*i.r^s B -^ii^fc 

£ DMAf^ ^♦^fcWJSt 1 4^- 7B**»36 
[0 0 2 9] »B-f-*«»«^**5ETt«4;. 

RTSll PPA-^17 1 =I^Pl2i/^1 4 0f4flHM* 

«I4* fcofc^a- K2*ifcx-$ *IHL*4tl-***-e 

$^xr±B8»t-S-t^X^«:iAG0Tf.. -ft (4, 
* tfn - K 3 ft fc c i: £^lEir 4 o 



[0 0 3 0] P$t§tf*r i: I+H i** £ 

(4, fYv^JH^^Qt ^ » =i71 1 OtiMMMff 

T^PSJUm^U^VV- " 371 1 0 

ftj*pjBgl PPA-F^I7 » a?Dfe^ HOCS 
6l*^fi(fC*l^ C©££li, WHRft^»#*W= 
Ar&x-*|EatfK7Lf==£*«iM-4 fl -Oui 

14, H*§j£W^I PPA-^17" 3?n*7t1 4 

7 1 1 oicj=oxiwa*i-4v7 Mx7 » sh— 
#^a-bw*- =171 1 or*.. a^3*t — £ p—KUft 

©«T«4B&14*9a**lS»^^y ■ 7*-fe*@JM 
2 0fr&Stf-4iEStffc4 o «&#**Til=f*. 
*t-£x - fl,—*>«£%J:[f4ttft 1*4641*. * 

-tfx - T>^a«a#^i, 

— *EljHlft*f4.. ^Y^JMl^n-tt ^ ■ ="7P*Jtf> 

[0 03 1] 5 1 ^(D^tt^^^if x-^ ESllfr 
^T?fc£„ atfiT-£BSI**l4. @c£&£*«t-* 

4> iHiT-^Pj^^gfj^^^.. ws^rT^i p 
PA-h'^17 - 4 0l£, Sft^u ■ 

7*-fex»frt hy 4a ;:©e«* 

IW*U ' 7*-b^BMW4. x^^/UM^n-fey 
it - =171 1 Of=£oX«JI;:R£**i- iiMx-^ls] 
)Drt^=Ui*ofc*jc|ll«*mi I P P/%- WX7 - 
a^a-fr- v -itl 4 ofr&«**»tt4£*l=H»**i 

4 9 il^^^'J ■ T<7i?XE!^1 2 0]&Caa©DMAf 

v>*ju*u-#— h-r4«^-f=(*.. s«T-5fpaa^ 



(9) 



HI120C1 - 23 64 9 6 



jfft'S^— K ^XTS^Stlffi Lft 9 ft rift 6ft i ^ * 

fete, sif-^ mm^fc 2 o»±«^-v >*^** 
■u-jR-- K £ ft x § *g£{c r* D M A =? * ^ -4ft 
r. Eft^u ■ 7-snrxsfti 2ofi(o«iip/<7/- 

i*.. esm^u 7*-&xis»i 2ote»a«LT.. 
jot*.' 

[0 0 3 2] 0jaW|g»JIWJ!Ufrft4<. HJHSTft*, 
m^Kltm^m (assert_s i&nal) ft'frflDaSM&JfiT? 

MB!rr**t=Bft5fc. WltfitMJIi p p/\- 
F 1 )!? ■ 4 01*^ v£;HS^:7p-& 

371101:1 -z><j>m%%m&o £«M»*fttf 
^v^ufi-^^n-b^-ziT'i 1 of*, gfll 
JSrI^i PPa-K*I7 4 0f::3£& 

*ifc*TOiailM!>***<a7 Lfccfc &flM?4. 
|-J:oTf4, ga»S*LtSfcliDSP371 10i6< 

ft* IJiSa^lltoofct^rc^-f »^^«#^n-fr y 

■ =171 1 oi=ay at**** i*»**a&* 0 -ft 

I** SflMtRTffii P P/\-K^x7 ■ a^P-fesriM 4 

£ 0 iPPtt*fc. U-b*h 

fflrite_j)aramater) ft£^DH<tt©8J»/J^Bft 
*tSL4 0 /<:7 (write_parameter) ft 

«4/<5*-*S*SfT3<DlcJHl*«, fiSlcSMSft 
« / <^ * - * (4 .. * $ * 0 -c»S * ft h W A $ ft 
X=bJ:tA e M*=*E*ftfttv^^-*T?*6 

h A#>J8©iiinJll fi»<Z>±/TI» MP© 
±/TKiftS««J:tf^v> K ■' IM* (8/16U 
*H <0J:5«/<5> t -#(4. (writ 
e__parameter) ft*fr*JHl-vt»E*-4 C <t Wt?* -5 8 
[0 0 3 3] UStiSnTll i ppa-K")i7 ■ =i3fn-fr 
iyWf4«fl!>tf»***B«^#- h"f 4« 
■ ff/JU8jaDCT/ I DCT 

[0 0 3 4] ±<D-lftMftBtS;ft««>fi5fieJ:tP 

xflsic* y , i p pr±**-y-*- h-r 6. 

2-DCDDCT/ I DCT 



[oo3 5] ftft^ri. aa-rsx-s^ffia^iaii 
ca*t-*) **T-^«a**-r* 

^^©^-f >^1S^^*>3 VA«W«ft«f=*tJDSft 

H«5ftSf4S^««"CfeS©i? WftT?*ftli2-» 
D?Piy$ffi3*<i***i4 0 

[0 0 3 6]il4(* Ml 0 0©3fflCE>RlftftiB3nj$^ 

to H4I=«-1»1 0 0f4.. 2*©SSt«RTBil FP 
A-K^i7 1 a?P*Ers>tM 4 0 1 SQ£-a£\ "T 
* 99 A««?Q ?# ■ a 71*.. £ 1 ©HflWraffi i 
p p/\— K^X7 ■ a^a-fe y+f 1 4 0&£tftR24>& 
**WffiI PPA-h^I7 ■ a?Ot»»1 80^^ 
»ffl'<X1 8 5I4.. « 1 <0H«« pT«I I P 
P/\-K^x7 1 aJn-bv-9-i 4 o SB a (OSHWtll 
ABi ppa-^17 • a^n-teartf-l 8 0 
5 a c^.b03^pfe7tft, r-r 9#JHB*^P-by 
#'3711 o©^tiJ4Bt*fl«»fl^'J«* 
flSSfl&a^n-fe^ #©jt^ 'Jlcj=^r**ft47 

Vfr - i: f= cfc oX., r-- $ « - £ 

*fc(4 f-a^o-b *ffl/**l 8 5fr 
*** * a ^ n-fe 5r#B«> "J >^ [c* y ffe^^ a -fe 
^it-fflAijK-Kl^filitbftilttATR-KSS+So - 
©StSf4- -*Oa?n-bsr#Cifcor«i**iSl o 

S"I©S©»^l= t 2 j&«atft ^Sfc^t L T^lcSfflTf A 

^p-b^-tJ-- 3711 Ol*E*=t*:l±«lije^U - 7 
^-trXHSgl 2 0S^Ltf-^ » /\> Kd"7^«i31L 
ftltft«46*^ctfr&*llc*ft4 0 

[0 0 3 7] »\<f>mt LT„ H5f4.. S/XfA tt* 1 

42^L fSftrsaSftfcf 8 y ^m^-^n^^ 

# =17 1 1 OtecfctfH«fitRrffi J PPA-MI7 

■ a71 1 0(4ffi*afl5©SftCDtO^J:.^ 0 »*L^ 
9ttdD»ST?tt, ^»RTt§ IPP/Nr-FW-a? 

p-b-51-9-1 4 of* Efiy^y - 7^4?x|p1^i 2oi 

ffiftir.. f<fy^Ut^t^' a7 1 1 oir(* 

?n-feyt- 371 4 0**flDrt»/'Cxf=ajaL4Ufc 
#fc. *ffli?*So ippa^p*?#l40tfi>xf 
A ■■ yUfcjgS££ft££„ x -^S^-/^-^ v Kf4 



(10) 
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[0 0 3 3] DS P<t I PP^HOffifiO-ffli t 

DSPttDMASas-bn hTy^LTf-J 
« I PPlcS*, DS Pf£Mtt4 (wait_unti 

l_signafj »*£lPPf=3i* (631**7+**:. 

53) o DSP^i? h-JL-ADSF (v8etor_add) 

PP[c3«.. cMz&i) DSPl4»ttS*iT 
«0)^X^€ff9. ftT?, DSPtfBot*TI PPO 
BTttHtf-xir^LriJc^L,. "<*HLJn)| Cvect 
cr_add) tfr^f::^ < ??ft3 3 (assertjsigna 
I) 4«S9H'i PP*X(7©S7Bf:DSP3&<ia» 
£«tfT*Ja\, gaiz, DSPliDMA*-b^-h7sr 



^ IT I PPfrb«£S*S* gTjf-^^fr^^lC 

^-^ ^ WftiOT?, i P P©«ftl»?n 
©f^CA-y^lcMLT i ppt?*X$— K*itfc*X0 

[003 9] IPPlt £ timX'Um ZtiXl^&k 

j;i;hi 2) f-rt sfi*sfii*Rrfti ppt 

10 04 0] 
HE'S] 





A ts 


A (8 


A {ft 


1) fcffsrK 






MAC) 


.MAC) 


MAC) 












MS 


flfflSf 




(UU1 =?1) 














i 




2, 4, 8 


1 


4, X, 12 
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m 













[004 1] H6A^BI1 5 1*.. flScOffMUtWlB I P 
P/x-K^xj 1 □^□■ir^1#'a>Pa**L-. Cut?.. 

061*,. *»^ff*L^3Hfeffl^ffif=*-6S« 
Jdt«TiBi PPn-K'Sxr- =3 1 40tJD-^ 

^^U'^y^tt, I PP-f >*— Xf* 1$ 
It T-*fi*t;v*p<r*«<D*#tti«^*u - ^ 

dspM^U±|:i/o^^^U ■ 7KL/XffiM 

# h^x? ■ U-fe-y h££CD<fc5& i pp<»*y 
SSi=^E*4i&^^J«-^$^-r^f3f4. (*X 
MClBr^) *m«&#7 K L/Xi«-fe sr K 7 5^**1* 

**tttff«(=u^^-* a #ia^«Bffl©i^<o^jai 
<D*&*7Fi-'X** |i]»#*r= i oft*« »4*S7tt 

[004 2] T»#*itf. k f-^MtahO^'J^I 



u r 7 kux ■ h^eosa&tfbttT Kuxt-t 

7rWHP 1 6/3 2 tf ? K » AXfcftlS^M© 
1 2 8 MR£<D|lil:=iE«^&« s CsWcr*. ^Sft 

y f-^5> Mtt^EJfl ivhli .. + »T! ft S c 

[0 04 3] 3 O^^S^ ^ U y*P f 5 - T—$ ' * 
•^#•50 ^^ [ J ■ XJ*.. IPP140^: 

i/xfiv ■ /^x 1 4 z^flratD^yHEt 
ax ■ 7^*xiB«P«=Ey«i:-a*i**©i=eKa 

ffl#&*X*l-ftfflL (FIFO) fij^^^ST^. x 

X-!r- KS4l.fc**i4f-^ " ^ JS E , J©**tW«fflt' 
5c^3!r«Te#afl)"(?. SB rAy3./ilijbj 



(11) 



ttffi 2001-236496 



r.. ciifeo^r***!!*!*^? m 9ocMr 

-s. 1 4 1 liS^i- v hi 4 2f-j: ^r=s 

^Bt6**o y£»ft+4fcabl=.. RTESA^tfJB 
I* 8*4, fl»i=v H 4 2 [4, 4«**lfc*I*P/<5 
*Hfiffl«l^-y hi 

S»JMBfei:::?7>70 Kf4. MWOTI4 
jhfefi^^U 7^-tzx^fc.. A*/tiJ*7*— ^-y* 

[0 0 4 4] t*-* ^"E-y 1 4 5^8^^'; 1 4 7 

fctt\ jfiiAjf^'j • -?ny9 <#i2 8t£?h) efc 

y . s^isojfe^j 1 e tf y h ■ r-*a»*i#-*— h-r 

6, C©1 2 sey hflEt^^'J ■ ^n^Jffl^ 
c: t ic 4 y , x-* $ * u k 7£ -fe 

X*4#5#&<ft4. X~S» '^'Jl 4 514, DS 

[4... DSP/U 1 4 2fr&£lt4fiv. 35;b<f4lPP£ 
£ft<bJb-y$7-y~f ' t— A2lx— £ t 

Jt*'J 1 4 5<bW^ f j 1 4 7 £(41 2 S fcfsr hgTt? 

■< S?*;«#^n-b^^ ■ a? 1 1 o£fcr4it&;*^'J 

feJ:tfflHkfl!> , Ja-Jl'ffl<3>7KUX*±*-r-B fl -0>^ 
fcyarfW*#j*^yfr&©1 2 8 try h©<r-*BT?lb 

[0045] ■ ^yfc^tffcftj^yfr&ay 
a-;bS*ifci 2 8if y h ■x—*Sr4 AS 7 si— v 

1 6 0fw#^.6*l*. A*7*— V-sr* 1 6 0f4 % 
«-5T(J>S/7 h*5*tf»ai»ffS-*SMf=ffo-C.. 1 2 8 

e * hA*T-*B*Hfa«)««t:«(«l>*P:»l1 ta 
4 0 A*?*— ^sr*f* 1 2 8tfy h (Sx 1 6 try 
h) f~*At 1 2 8 1=5/ h (8x1 6 £31 h) x— £ 
B£ 1 28fcfv h (8 x 1 6tf ? h) Mi&x-£<t£tB 

[0 0 4 6] Cft-b^S^CDx— * XhU — A. x— 
^A^x-^BJrmax— *tr** f-^ggSl 7 0f= 
**b*l4- x-2ig£&1 7 0[*q:7nfc-ytf©SMM» 



-?fe4 a x-S3£S§f4 Sfi L i*[^sl4*n:«^tf>|i|« 
ag*X{7£-9-*— h^S, B1 2£BM 3£I4*«» 

©2 0©#*U*Sfe©Pffi£*1"o l*<Ofr©*X$ 

■ 7£'feX©S&4'<^>£^x4:: 
£ a d*i6©3Kf4 BJSix-* ■ 
o-Cf*^ 1 ?-- ■*£*'<9>X-J-4©f=7^ I J'5r— *>a 

15* y *IM#*fT3ttft©'\-K*i7 
JEJt**5«fctfilPSHg*^4r. tt-*«»i 7 01*3-3© 
MMx— * 'Xh'J— A*fitt+4* z*l&©3 0© 
d*>©2 51*1 6t£-y h ' T-#HT5*y» ^3 00 
5^©1 '3141 2 8£i* hS CSX 1 ©tfi/ h) -C* 
4. 

[004 7] :^^39(Df-^ ■ x h'J— AttfflA 
7t"?'^1 8 0©AA*=«tt4*l4 tt 
*/$MB0t4 ^ ^ U [C#£M£tl 4fc <£>(:: 3 0©x- 
£ ■ X h 'J —A* 8-3© 1 2 8 t* y h 1 T— ^ SfCSE 
$!Tf4o Cft&©2-3©«afr»#©7FUXI*7K^ 

yB«#±©fiSA=b-&*ft3 D =J^P^'y-9"<»i6^!4 
Wi- yM 9 0©»Ti;&4 0 Mti->M 9 
or*, flrtM»4fT5lli 4 ij^fefMri'Ja— frt&ht 
$iz q^p-fey-y-i 4o«t?Sirj£t , 4*J»*tf5. 

[0 0 4 8] A*?*— Varirn 6 0©WS^B 7 (C^ 

**iru4 a 2 8 e v h©2 o©x— $ x 

hU— A, r— *AfiArf-T-5rB*^ TJU^^U*^ 
2 0 5 2 0 7©A*lc**lf*ltt***l4* 
^H?*!*.. 1-3©A**3WLTs *©*fJS 

■TiU^X* 2 15 2 1 7Kt*l¥illB«t , 4. 
^■?L/*iJ-2 0 5f*. A^IT-^ ■ Xh f J-A©-** 
S«**43&^ U5>X* 2 1 5©A**Mat'4. VJL 
^■^U^#'2 0 1f4 l 1x55X^2 1 5©rtg*SR1" 4 

-e^i©UvX^2i l©rtSSSiM*4 a 
^U^^2 0 7i4„ AAt-^ ■ XHJ-A«^1 
K"T4^. l/vX$2i 7©W§&S1iirr4o 
2 2 1 GOTftfcr y hI*US?X# 2 1 5J&*&-M.6ti4. 
3/7*2 21©±tte^ hf*bS?X$2 1 ir=*oT4 
^LbA*- ^7^2 2 1(*e©A*©A2 5 6iftf ht 
v7hLtIRL. 1 2 8e^hf4±/4*lSl64bx 
2- 1 T^^U^^2 3 1 128tfyh 
(**/1 *R|/4*1rP© 1 2 8 b x 3 - 1 ^JU^^U^ 
^2 3 5(C^^^.^ a 7^^23 10)128 
(f hfflftl*. US?X*24 1f=-l»tt(=B*4il*fc 
ttt. x— *ft»1 70^f-5AA*Si*t 
4. YiLf^l^^2 3 5© 1 2 8 fcf^ hfflil*.. 
X^ 2 4 5ld-^(CfSlt*tl,4i:ch*fc, t-^^ 
17 0^Of-*BAii»a«, 7;^^U^i^2 
0 7©^^](4. ±/1 w/2w/4w1 28 b x 4-1 



f#H2 0 0 1 -23 6 496 



X$ 2 1 7K&$&£;hS e 7/!,f^m23 7 S. U 
V7i2 1 7**6«<ft*ft«Al 2 8tf? i-fMi-r* 

[0 0 4 9] SJlcjt^fc<J:3f=. 3O0f-^ Xh'J 

a&icT-SfiKi 7 0(=ft***is B laatt, *sifl© 

^vt^f cz/eii, aA©****^-* h cma 
c) *t**]lc»S**iTl** ( Taj ft*) o »»©« 
©frtttfWJSlSih-BSJIRIMMIsf** flM-fi^©?* 

ft?dr=il#*in:N«©fiiAiS€tHM-** -©AWE 

;^ 7 ^m^'J >^£t^- h1-*S#S© 
MACtt3«©XlHlcWtL Mil- •/ h 1 9 0 It 

[0 0 5 0] B9H:.. ■6I=$L&ttjtl7^— 1 

SOOM^tc S»*wjfi6l PPA-MI7-3 

3fp-b-i/-tH 4 0Ml^J;lfl2{01^©i 6(f 
y h ■ (Ace [0] j3J:tfAcc 

[1] ) fitt*7 +— 7 ^ 1 8 0"".©S4a©2Q©A 

■^14 0^8iiDl»^r^ (Aco 
[0] Ac c [1 ] Ac c [2] Ac c [3] 
Acc [4] Acc [5] Acc [6] ^(/Ac 
c [7] ) ^S7t-^^^l3(?)A*$iit 

-* ■ /f'J 1 4 5E*S&&h«. 

&lf?— £Bff>1 2 8 tf y h (8 X 1 b) ' ■?— 

£SA*©e*©-fe^i/Mi. jnff»x»*s cans 

S) 310 32 0 330 340 350 36 

o 3 7 0 3 a oi=tt»**i5 0 HJ=^t«fe-5l=. 1 
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dptr = dptr_init; /* xttypnyMM. */ 
cptr = cptr_init; 
optr = optr_init; 



for fil-Q? iK^lplend; { 

for ii2~Q; i2<*lp2snd; ( 
for |i.3=D; i3<^lpjend; { 
for ?i4*°0; i4<*=2p4endf { 

/* or dptr[0], dpfcr(0,lj, dptr [0, 1,2, 3J ^ftfk 
yCO 7] « cpCrCD..7]; 

A or cptrCO], cpfcrlO JJ, 

if (inicialize_*cc) 

acefi4*-accBiodeJ £0 7] =- rxid u add{0 . 7f ; 

/* M» */ 

acc[±4Mccmode] [0 7] +^ x[0 ,7] op y[0 7] j 

/* WMV */ 

if fwrite_back} 

aptr£0.,7] - -saturate_raur?c[(scc[i^ ^accmade] [0.. 7])S; 

dptr +- : ; 

cptr + *> . 
apfcr +™ .. . ; 
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5 Recorrfiguiable SIMD Coprocessor AicMteeturn foi Sum of Absolute Differences and 
Symmetric FHteiing(ScalabIe MAC Engine foi linage Processing) 

Field of the Invention 

This invention relates in general to signal processing and more specifically to Single 
1 0 Instruction Multiple Data (SIMD) coprocessor arcMtecfuies providing foe faster image and 
video signal processing, including one and two dimensional filtering, transfoims, and other 
common tasks 

Background of the Invention 
1 5 A problem which has arisen in image processing technology is that Lwo-dirnensional (2-D) 
filtering has a different addressing ^ttera than one dimensional (H3) filtering Previous DSP 
processors and coprocessors, designed foi 1-D, may have to be modified to process 2 D video 
signals Tbe end desired goal is to enable a digital signal processor (D3F) or coprocessor to 
perform image and video processing expediently In image processing, the mosi useful 
220 operation is 1-D and 2-D ffitcriog, which requires addressing the 2-D data and 1-D ox 2-D 
convolution coefficients When the convolution coefficients are symmetrical, architecture that 
malces use of the symmetry can reduce computation time r oughly in half I he primary 
botcleneclc identified for most video encoding algorithms is thai of motion esiinianon I he 
problem of morion estimation may be addressed by first convolving an image with a kernel to 
25 reduce e£ into lower resolution images These images are then recpnvolved with the same 
Jcerael to produce even lower resolution images I he sum ot absolute differences may then be 
computed within a search window at escn level to deierrnine The best matching subima^e fot a 
subirnage in the previous frame Once the best match is found at lower resolution, the search 
is repeated within the corresponding neighborhood at higher resolutions In view of the above, 
30 a need to produce an architecture capable of pezformrng the 1-D/2-D tllteikg, preferably 
symmetrical filtering as well, and the sum of absolute differences with equal efficiency has 
been generated Pieviously, specialized hardware or general purpose DSPs were used to 
perform the operations of summing of absolute differences and symmetric filtering in SIMD 
coprocessor architectuies.. Intel's MMK technology is similar in concept although much more 
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5 general purpose. Copending application filed oil February 4, 1993, titled "Malti-Mulriply- 
Accnmulate (Multi-MAC) coprocessor aicHteeture", Serial No 60/G73 4 668(TI-26868) and 
"DSP with Efficiently Connected Hardware Coprocessor", Serial No 60/073 , 641 {TI-2 6867) 
embody host processor/coprocessor interface and efficient Finite Impulse Response/Fast 
Fourier Iiansfoirn (Fffi/FFI) filtering implementations that this invention is extending to 
1 0 several other functions 

Summary of the Invention 

The proposed architecture is integrated onto a Digital Signal Processor (DSP) as a 

coprocessor to assist in the computation of sum of absolute differences, symmetrical 
15 row/column Finite impulse Response (FIR) filtering with a downsampling (or upsampling) 
option, row/column Discrete Cosine I iansform (DCT)/Inverse Discrete Cosine Transform 
(IDCT), and generic algebraic functions The architecture is called IPP, which stands for 
image processing peripheral, and consists of 8 multiply-accmnulate hardware units connected 
in parallel and routed and multiplexed together I he architecture can he dependent upon a 
2S) Direct Memory Access (DMA) controller to tetrieve and write back data from/to DSP memory 
without intervention from the DSP core The DSP can set up the DMA ttansfer and IPP/DMA 
synchronization in advance, then go on its own processing task Alternatively, the DSP can 
perform the data transfers and synchronization itself by synchronizing with the IPP architecture 
on these transfers This architecture implements 2-D filtering, symmetrical filtering, short 
25 filters, sum of absolute differences T and mosaic decoding more effLciemly than the previously 
disclosed Multi-MAC coprocessor architecture 1 1 268 68 T Serial No 60/073,641, titled 
"Reconfigurabie Multiple Multiply- Accumulate Hardware Co-Processor Unif ¥ filed on 
January 4 199 S and incorporated herein by reference) This coprocessor will greatly 
accelerate the DSP T s capacity to perform specifically common 2-D signal processing tasks 
30 The architecture is also scalable providing an integer speed up in performance for each 

additional Single Instruction Multiple Data (SIMD) block added to the architecture (provided 
[he DMA can handle data transfers among the DSP and the coprocessor at a rapid enough 
i2it) This technology could greatly accelerate video encoding This architecture may be 
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5 integi ated onto existing DSPs such as the Texas Instruments TMS320C54x and IMS320C6x 
Each of these processors already contains a DMA controller for data transfers 

Brief Description of the Drawings 
The accompanying drawings, which are incorporated in and constitute a part of the 
10 specif ication, schematically illustrate a preferred embodiment of the invention and, together 
with the general descripiion given above and the detailed description of the preferred 
embodiment given below, seive to explain the principles of the invention Ihese and other 
aspects oi this invention are illustrated in the drawings, in which: 

Figure 1 illustrates the combination of a digital signal processor core and a 
15 ^configurable hardware coprocessor in accordance with this invention, with the coprocessoi 
closely coupled tome internal bus of the DSP 

Figure 2 illustrates the memoiy map logical coupling between the digital signal 
processor core and the reconfigurable hardware co-ptocessoi of this invention; 

Figure 3 illustrates a manner of using the reconfigurable IPP hardware co -processor of 
20 this invention; 

Figure 4 illustrates an alter native embodiment of the combination of Fig 1 including 
two co-processors with a private bus in between; 

Figure 5 illustrates an alternate connection between DSP and the IPP coprocessor, 
where the coprocessor and its memory blocks form a subsystem which is loosely connected to 
25 DSP on a system bus 

Figure 6 illustrates the IPP overall block diagram architecture according to a preferred 
embodiment of . the invention 

Figure 7 illustrates the input formatter of die reconfigurable IPP hardware co-processor 
illustr ated in Figure 6 

30 Figute 8 illustrates a schematic diagram of the IPP Datapath Architecture A, with 8 

independent MACs 

Figure 9 illustrates the output formatter of the reconfigurable IPP hardware co- 
processor illustrated in Figure 6 
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5 figure 10 illustrates a diagram of the IPP datapath architecture B of one alternative 

adder configuration of the adder portion of the IPP, the single 8-tree adder, according to a 
preferred embodiment 

Figure 11 illustiates a diagr am of the IPP datapath architecture C of another alternative 
adder configurations of the adder portion of the IPP, dual 4-feees with butterfly, according to a 
1 0 preferred embodiment 

Figure 12 illustrates a diagiam of the IPP datapath architecture D of another alternative 
adder configuration of the adder portion of the IPP, quad-2 tiess, accotding to a preferred 
embodiment 

Figure 13 illustrates a diagiam of the IPP reconfigurable datapath architecture that 
1 5 includes routing and multiplexing necessary to support the A/B/C/D configurations show in 
Figures 8, 10, 11, and 12 

figure 14 illustrates a diagram of a simplified version of the IPP reconfigurable 
datapath architecture, which supports the previous A and D version without Pre-Add (Figures 
8 and 12) 

20 Figure 15 illustrates a diagiam of another simplified version of the IPP datapath 

architecture which only has 4 MACs and supports only the previous A version without Pre- 
Add 

Figure 16 illustrates the refoirnatting of the input coefficients to the Datapath block 
necessary co perform a 3-tap FIE. ROW filtering according to a preferred embodiment of the 
25 invention. 

Figure 17 illustrates the reformatting of the input coefficients to the Datapath block 
necessary to perform a 3 tap symmetric FIR ROW filtering according to a preferred 
embodiment of the invention 

Figure 18 illustrates ftom where, in the memory, the input coefficients are read and 
30 whereto the output coefficients are written, necessary to perform a 3-tap FIR column filtering 
according to a preferred embodiment of the invention 

Figure 19 illustrates a schematic of the data path block wtth a tree addei when the IPP 
is performing a sum of absohite differences operation according to a preferred embodiment of 
the invention 
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5 Figare 20 illustrates the lesser density of the Red and Blue colors veisus the Gteea 

color involved m a dernosaic operation 

Figure 21 illustrates the reformatting of the data necessary to perform a ROW pass 
portion of die demosaic operation according to a preferred embodiment of the Invention 

Figure 22 illustrates the reformatting of the data necessaiy to perfoim a COLTJMH pass 
10 portion of the demosaic operation according to a preferred embodiment of the invention 

Figure 23 illustrates the reformatting of the input data necessary to peiform row- wise 
wavelets tiansfoim, similar to symmetric ROW filtering, according to aprefened embodiment 
of the invention 

Figare 24 illustrates the reformatting of the input data necessary to peiform column - 
15 wise wavelets transform, similar to column filtering, according to a preferred embodiment of 
the invention 

Figure 25 illustrates the post-multiplier adders of a split adder tree with butterfly 
configuration (C T Figure 11} necessary to implement the cross additions and subtractions of the 
row wise Inverse Discrete Cosine Txansf otm(IDCT) 

30 Figure 26 illustrates the pi e -multiply adders of a split adder tree with buxteifly 

configuration (C, Figure 12} with the butterfly disabled necessary to implement the cross 
additions and subtractions of theiow- wise Discrete Cosine Iransfoim(DCI). 

Figure 27 illustrates the column-wise IDCT and DCT implemented in 5XMD mode of 
operation, similar to the column FIR tillering 

25 Figure 28 illustrates two of &e 8 MAC units of Figure 14 in a more detailed drawing of 
components 



Detailed Description of tfce Prefetred Embodiments 

Figure 1 illnsuaces circuit 100 including digital signal processor core 110 and a 
30 recontigurabLc IFF hardware co processor 140 Figure 1 is the same figure 1 as in co-pending 
application Serial No 60/073,641, tided "Reconfigurabie Multiple Multiply Accumulate 
Hardware Co-processor Unit" assigned to the same assignee, the co-piocessor of which a 
preferred embodiment of [his invention is made In accordance with a preferred embodiment 
of this invention, these parts are formed in a single integrated circuit (IC). Digital signal 
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processor core 110 may be of convention design. The IPP is a memoiy mapped peripheral 
Transferring data between EPF's and DSP's working memory can be carried out via the Direct 
Memory Access (DMA) controller 120 without intervention from the digital signal processor 
core 110 Alternatively, the DSP core 110 can handle daia transfer itself via direct load/store 
to IPP's working memory 141 , 145 and 147 A combination of the two transfer mechanisms is 
also possible, as the DMA can handle large data/coefficient transfers more efficiently, and the 
DSP can duectly write out short commands to IPP command memoiy 141 more efficiently 
The TeconfigurabJe IPP hardware co-processor 140 has a wide range of functionality 
and supports syrnmeiricaVasynmimetriea] row/column filtering, 2-D filtering, sum of absolute 
differences,, row/column DCI73DCT and generic linear algebraic functions Symmetrical 
row/column faltering is frequently used in up/down sampling to resize images to fit display 
devices Two-dimcnaronal filtering is often used for demosaic snd foi image enhancement in 
digital cameras Sum of absolute differences is implemented in MPEG video encoding and 
H 263 and H 323, encoding standards for the telephone line video conferencing Row/column 
DCTflDCT is implemented in JPEG image encoding/decoding and MPEG video 
encoding/decoding Geneirc linear algebraic functions, including array addition/subtraction 
and scaling are frequently used in imaging and video applications to supplement the filtering 
and transform operations For example, digital cameras require scaling of pixels to implement 
gain control and white balancing 

In tie prefer ted embodiment, reconfigurable IPP hardware co-processor 140 can be 
programmed to coordinate with direct memory access circuit 120 for autonomous data transfers 
independent of digital signal processor core 1 10 External memory interface 130 serves to 
interface she internal data bus 101 and addiess bus 103 to then external counterparts external 
data bus 131 and external address bus 133, respectively External memory interfece 130 is 
conventional in construction Integrated circuit 100 may optionally include additional 
conventional features aad circuits Note particularly that the addition of cache memory to 
integrated circuit 100 could substantially improve performance . I he parts illustiated in Figure 
1 are not intended to exclude the pi o vision of other conventional parts Those conventional 
parts illustrated in Figure 1 arc merely the parts most effected by the addition of reconfigurable 
hardware co-processor 140 



TI-27 177-6 



(34) 



#SB2 0 0 1 -23649 6 



5 Reconfigurable IPP hardware co-processor 140 is coupled to otiier parts of integrated 

circuit 100 via a data bus 101 and address bus 103 Reconfigurable IPP hardware co-processor 
140 includes command memory 141, coprocessor logic coie 143, daia memoiy 145, and 
coefficient memory 147 Command memory 141 serves as trie conduit by which digital signal 
processor core 1 10 controls the operations of reconfigtiiabie hardware co-processor 140 Co- 
10 processor' logic cone 143 is responsive to commands stored in command memoiy 141 winch 
form a command queue to perform co-processing functions These co-processing functions 
involve exchange of data between co-processor logic core 143 and data memory 145 and 
coefficient memoiy 147, Data memory 145 stores the input data processed by reconfigarable 
hardware co-processor 140 and further stores the resultant of the operations of reconfigurable 
15 hardware co-processor 140 Coefficient memoiy 147 stores the unchanging or relatively 
unchanging process parameters called coefficients used by co-processor logic core 143 
Though data memory 145 and coefficient memory 147 have been shown as separate parts, it 
would be easy to employ these merely as difTeieai portions of a single, unified memory As 
wilL be shown below, for the multiple multiply accumulate co -processor described, it is best if 
20 such a single unified memory has two read ports for leading data and coefficients and one 
write port for writing output data. As multiple-pott memoiy takes up more silicon area than 
single-port memory of the same capacity, the memoiy system can be partitioned to blocks to 
achieve multiple access points With such memory configuration, it is desirable to equip IPP 
with memory arbitration and stalling mechanism to deal with memoiy access conflicts It is 
25 believed best that the memory accessible by reconfiguiabie IPP riardware co-processor 140 be 
located on the same integrated circuit in physical proximity to co processor logic core 143 
This physical closeness is needed to accommodate the wide memory buses required by the 
desired data throughput of co-processor logic core 143 

Pigure 2 illuslxates ihe memory mapped interface between digital signal processor core 
30 i ID and reconfiguiabie IPP hardware coprocessor 140,, Digital signal processor core 1 10 
controls reconfigurable IPP hardware coprocessor 140 via command memoiy L41 In the 
preferred embodiment, command memory 141 is a fir st-in- first-out (FIFO) memoiy with a 
commarid queue Trie write port of command memory 141 is memory mapped into a ginglc 
memory location within the address space of digital signal processor cor e 110 Thus digital 
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5 signa] processor core 110 controls reconfiguiable £PP hardware co-processor 140 by writing 
commands to the address saving as the input to command memoiy 141 . Command memory 
141 preferably includes two circularly oriented pointers The write pointer 151 points to the 
location within command memoiy 141 wherein the next received command is to be stored 
Each time there is a write to the predetermined address of command memoiy 141, write 

10 pointer selects the physical location receiving the data. Following such a daia write, write 
pointer 151 is updated to point to the next physical location within command memory 141 
Write pointer 151 is circularly oriented in that it wraps around from the last physical location 
to the first physical location Jteconfigurable IPP hardware co-processor 140 reads commands 
from command memory 141 in the same order as they are received (FIFO) using read pointer 

15 153. Read pointei 153 points to the physical location with command memory 141 storing the 
next command to be read Read pointei 153 is updated to reference the next physical location 
within command memory 141 following each such read Note that read pointer 153 is also 
circularly oriented and wraps around &om the last physical location to the first physical 
location Conntaand memory 141 includes a feature preventing write pointer 151 from passing 

20 read pointer 153 This may ta&e place, for example, by sensing to write and sending a 

memory fault signal back to digital signal processoi core 110 when write pointer 151 and read 
pointer' 153 reference the same physical location. Thus the FIFO buffer of command memory 
141 can be full and not accept additional commands 

Many digital signal processing tasks will use plural instances of similar' functions For 

25 example, the process may include several filtei functions Recoafigurable IFF hardware co- 
processor 140 preferably has sufficient processing capability to perform all of these filter 
functions in real time 1 he macro store area 149 can be used to store common function in 
form of subroutines so that invoking these functions takes just a "call subroutine 7 ' command m 
the command queue 141 This i educes traffic on the command memory and potentially 

30 memory requirement on the command memory as a whole Figure 2 illustrates 3 subroutines 
A, B, and C residing on the macio store area 149, with each subroutine ending with a "return" 
command 

Alternate to the command FIFOftnacro store combination is static command memory 
contents that DSP set up initially The command memory can hold multiple command 
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sequences, each ending with a "sleep" command.. DSP instructs IPP to execute a particular 
command sequence by wiring the starting address of the sequence to an IPP control registei 
!PP executes the specified commands, until encountering the sleep command, when It goes into 
standby mode waiting for further instruction from the DSP Data memory 145 and coefficient 
memory 147 can both be mapped within the Data address space of digital signal processor core 
1 10 As illustrated in Figure 2, Data lus 101 is bidirectionally coupled to memory 149. In 
accordance with the alternative embodiment noted above, both data memory 145 and 
coefficient memory 147 are formed as a part of memory 149 Memory 149 is also accessible 
by co-processor logic core 143 (not illustrated in Figure 2) Figure 2 illustrates three 
circumscribed areas of memory within memory 149 As will be farther described below, 
reconfigurable hardware co -processor 140 performs several functions employing differing 
memory areas.. 

Integrated circuit 100 operates as follows Either digital signal piocessor core UO or 
DMA controller 120 control the data and coefficients used by reconfigurable IPP hardware co- 
processor 140 by loading the data into data memory 145 and the coefficients into coefficient 
memory 147 or, alternatively, both the data and the coefficients into unified memory 149 
Digital signal processor core II 0 may be programmed to perform this data transfer directly, gi 
alternatively, digital signal processor core 110 may be programmed 10 control DMA controller 
120 to perfoim this data transfer Particularly for audio or video processing applications, the 
data stream is received at a predictable rate and from a predictable device Ihns it would be 
typically efficient for digital processor core 110 to control DMA controller 120 to make 
transfers from external memory to memory accessible by reconfigurable hardware co- 
processor 140 

Following the transfer of data to be processed, digital signal processor core 110 signals 
reconfigurable IPP hardware co processor core 140 with the command for' the desired signal 
processing algorithm As previously stated, commands are sent to a reconfigurable IPP . 
haidware co-processor 140 by a memory write to a predetermined address within Command 
Queue 141. Received commands are stoced in Command Queue 141 on a Erst in- first -out 
basis Each computational command of reconfigurable IPP co- processor preferable includes a 
manner to specify ihe particular function to be performed In the pi ef erred embodiment, 
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reconSgnrabte hardware co-processoi is constructed to be reconfigarable Reconfigurabie IPP 
hardware co-processor lias a set of functional units, such as multipliers and adders, that can oe 
connected together in differing ways to perform different but related functions The set of 
r elated functions selected fot each reconfigurabie hardware co-processor will be based upon a 
similarity of the mathematics of the junctions This similarity in mathematics enables similar 
hardware to be reconfigured for the plural functions I he command may indicate the 
paitjculaj computation via m opcode in the manner of data processor instructions 

Each computational command includes a manner of specifying the location of the input 
data to be used by the computation I her e are many suitable methods of designating data 
space Fox example, the command may specify a starting address and number of data words or 
samples within the block The data size may be specified as a par ameter 01 it may be specified 
by the op code dcmiing the computation type As a further example, the command may 
specify the data size, the starting address and the ending address of the input data Note that 
known indirect methods of specifying where the input data is stored may be used Ihe 
command may include a pointei to a register or a memory location storing any number of these 
parameters such as start address, data size, and number of samples within the Data block and 
end address 

Each computational command must further indicate the memory address radge storing 
the output data of the particular command This indication may be made by any of the 
methods listed previously with regard to the locations storing the input data. In many cases the 
computational function will be a simple filter function and the amount of output data following 
processing will be about equivalent to the amount of input data In other cases > the amount of 
output data may be more or less than the amount of inpnt data In any event, the amount of 
resultant data is known from the amount of input Data and the type of computational function 
requested Ihus merely specifying the SJarting address provides sufficient information to 
indicate where all the resultant data is to be stored It is feasible to store output data la a 
destructive manner overwriting input data during processing Alternatively T the output data 
may be written to a different portion of rnemoiy and the input data preserved at least 
temporarily Ihe selection between these alternatives may depend upon whether the input data 
will be reused 
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5 Figure 3 Olustrates one useful technique involving alternatively employing two memory 

areas One memory area 145 stores the input data needed for co-processor function. The 
lelativeiy constant coefficients are stored in coefficient memory 147 The input data is recalled 
for use by co-processoi logic core 143(1 read) from a first memory area 144 of flic data 
memory 145 lire output data is written into the second memory area 146 of the data 
10 mernory(i write). Following use of the data memory area, direct memory access circuit 120 
writes the data into the fust memoiy area 144 foi the next block, overwriting the data 
previously used . (2 write) At fee same time, direct memoiy access circuit 120 reads data from 
second memory area 145 ahead of it being overwritten by reconfiguiable hardware co- 
processor 14D {2 read) Ihese two memory areas for input Data and for resultant data could 
15 be configured as circular buffers In a product that requires plural related functions, separate 
memory areas defined as circular buffers can be employed. One memory area configured as a 
circular buffer will be allocated to each separate function 

The format of computational commands preferably closely resembles the format of a 
subroutine call instruction in a high level language That is, the command includes a command 
2€ same similar in function to the subroutine name specifying the particular computational 
function to be perfumed Each command also includes a set of parameters specifying 
available options within the command type. Foi example, &e following list of computational 
commands and the various parameters; 

Row filter (us, ds, length, block, data addr, coef_addi, outp addi) 
25 Column Jilter(u5 1 ds, length, block, data_addr s coef_addr t outp addi) 
Row_mter_sym(us, ds, length, block, datajaddr, coef addr, outp_addr) 
Siim_abs_dirf (length, data addil, da£a_addr2, outp. addi) 
Row_DCT{data_addi, outpjiddr), RowJDCT, CotamnJDCT, Column JDCI 
Vector_add(iengtfc, data_addrl, data_addr2» outp_addr) 
30 These parameters may take the foim of direct quantities or variables, which are pointers to 
registei $ or memoiy locations storing the desired quantities I he number and type of these 
parameters depends upon the command type.. This subroutine call format is important in 
reusing programs written for digital signal processor core 1 10 Upon use, the programmer or 
compiler provides a stub subroutine to activate reconfjgurable IPP hardware co-piocessor 140 
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Ibis stab subroutine merely receives the subroutine paiameters and forms the conesponding 
coprocessor command using these parameters.. The stub subroutine then writes this command 
to the predetermined memory address reserved fox command transfers to leconfigurable 
hardware co- processor 140 and then returns Ihis invention envisions that the computational 
capacity of digital signal processor cores will increase regularly with time Thus the 
processing requirements of apaiticulai product may require the combination of digital signal 
processor core 110 and reconfiguiable IPP hardware co-processor 140 at one point in time. At 
a later point in time, the available computational capacity of an instruction set digital signal 
processor core may increase so that the functions previously requiring a reconfigurahle IPP 
hardware co-processor may be performed in software by the digital signal processor core The 
prior program code for the product may be easily converted to the new, more powerful digital 
signal processor This is achieved by providing independent subroutines for each of the 
commands supported by the replaced reconrlgurable hardware co-processor Ihen each place 
where the original program employs the subroutine stub to transmit a command is replaced by 
the corresponding subroutine call,, Extensive reprogramming is thus avoided 

Following completion of piocessing on one block of data, the data may be transferred 
out of data memory 145. This second transfer can take place either by direct action of digital 
signal processor core 110 reading the data stored at the output memory locations or through the 
aid of direct memory access circuit 120 This output data may represent the output of the 
process In this event, the data is transferred to a utilization device. Alternatively, the output 
data of leconfigurable IPP hardware co-processor 140 may represent work in progress.. In this 
case, the data will typically be temporarily stored in memory external to integrated circuit 100 
for later retrieval and furthei processing 

JLecooflgurabie IPP hardware co-processor 140 is then ready for further use This 
further use may be additional processing of the same function. In this case, the process 
described above h repeated on a new block of data in the same way Ibis fuither use may be 
processing of another function In this case, the new block of data must be loaded into 
memoiy accessible by reconflgutabie IPP hardware co-processor 140 t the new command 
loaded and then the processed data read for output or further processing 
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Reconfigurable IPP hardware co-piocessor 140 preferably will be able to perform more 
than one function of the product algorithm The advantage of operating on blocks of data 
rather than discrete samples will be evident when leconfigurable IPP hardware co-processoi 
140 operates in such a system As an example, suppose thai reconfigcrable IPP hardware co- 
processor 140 performs three functions, A, B and C These functions may be sequential oi 
they may be inter leaved with functions perfoimed by digital signal processor coie 110 
^configurable IPP hardware co-piocessoi 140 first performs function A on a block of data. 
This function is performed as outlined above Digital signal processoi core 1 10 either directly 
or by control of direct memory access circuit 120 loads the input data into data memory 145 
Upon issue of the command for cortfiguiation for function A which specifies the amount of 
data to be processed, reconfigurabie IPP hardware co-processor 140 performs function A and 
stores the resultant data back into the portion of memory 145 specified by the command A 
sirnilai process occurs to cause reconfigurable IP? hardware co-processor 140 to peiform 
function B on data stored in memory 145 and return the result to memory 145 The 
performance of function A may take place upon Data blocks having a size unrelated to the size 
of the Data blocks for function B. Finally, reconfigurable IPP hardware co-processor 140 is 
commanded to perform function C on data within memory 145, returning the resultant to 
memory 345 The block size fat performing function C is independent of the block sizes 
selected for functions A and B 

Ihe usefulness of the block processing is seen from this example. The three functions 
A, B and C will typically perfonn amounts of work related to one common data processing 
size (fbr example, one 16 x 16 block of pixels as a final output), that is not necessarily equal in 
actual input/output sizes due to filter history and up/down sampling among functions 
Provision of special hardware for each function will sacrifice the geneiality of ftinctionality and 
reusability of reconfigurable hardware Further, it would be difficult to match the resources 
granted to each function in hardware to provide a balance and the best utilization oi the 
hardware When reconfigurable hardware is used there is inevitably an overhead cost fot 
switching between configurations. Operating on a sample by sample basis for flow through the 
three functions would require a rnaxirnurn number of such reconfiguration switches I his 
scenario would clearly be less than optimal Thus operating each function on a block of Data 
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before reconfiguration to switch between functions would reduce this overhead Additionally, 
it would then be relatively easy to allocate resources between the functions by selecting the 
amount of time devoted to each function Lastly, such block processing would generally 
require less control overhead from the digital signal processoi core than switching between 
functions at s sample level 

The block sizes selected for the various functions A, B and C will depend npon the 
xeiative data rates required and the data sizes In addition, the tasks assigned to digital signal 
processor core 110 and theii lespective computational requirements must also be considered 
Ideally, both digital signal processor core 110 and leeonrlguiahle IPP hardware co-processor 
140 would be nearly fully loaded This would r esult in optimum use of the resources The 
amount of work that should be assigned to the IPP depends on the speedup factor of the IPP 
co-processor 140 versus the DSP cere 1 10 For example, when the IPP is 4 times Faster than 
the DSP 1 the optirrmm workload is to assign 80% of tie work to the IPP, and 20% to the DSP 
to accomplish 5 feas the total speedup Such balanced loading may only be achieved with 
product algorithms with fixed and known functions aad a stable data rate Ibis should be the 
case foi most imaging and video applications Jf the computational load is expected to change 
with time, then it will probably be best to dynamically allocate computational resources 
between digital signal processor core i 10 and reconf igurabie IPP hardware co-processor 140 
In this case it is best to keep the functions performed by ^configurable IPP hardware co- 
processor 140 relatively stable and only the functions performed by digital signal processor 
core 110 would vary ,, 

The cornrnarKl set of Reconfiguraibk IPP hardware co-processor 140 preferably includes 
several non-computational mstructious for control functiQEis 

Receive dam_ synchronization (signal, true/false), or wait_ until _signal 

Send datajynchronization (signal, true/false), or assert jsignai 

Synchronization , .completion (signal, true/false), or assert_signal 

Call^subroutine^subroutine^addi) 

RerumO 

ReseiQ 

SleepO 
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5 Wriiej>aiameter (parameter, value) 

These control fanctions will be useful in cooperation between digital signal processor core 110 
and reconfiguiabie IPP hardware co-processor 140 The first of these commands is a 
receive_dara_synchi'oin2ati<m command. XMs command can also be called a wait_ until signal 
command litis command will typically be used in conjunction with data transfers handled by 
1 0 direct memory access circuit 120 Digital signal processor core 110 will control the process by 
setting up the input data transfer through direct memoiy access circuit 120- Digital signal 
processor core 210 will send two commands to reconfiguiable IPP hardware co-processor 140 
Ibe ft st co mma nd is trie receive data synchronization command and the second command is 
the computational command desired 
1 5 Reconfigurable IPP hardware co- processor 140 operates on commands stored in the 

coonnand queue 141 on a first- in-fiist-out basis Upon reaching the receive data 
synchronization command, recoofiguraWe IPP hardware co-processor will stop 
Reconugurahle IPP hardware co-processor will remain idle until it receives the indicated 
control signal from direct memory access circuit 120 indicating completion of the input data 
transfei Note that direct memory access cir cuit 120 may be able to handle plural queried data 
transfers.. Phis ia [mown in the art as plural DMA channels . In this case, the receive data 
synchronization command must specify the hardware signal corresponding to the DMA channel 
used for input data transfer 

Following tire completed receive data synchronization command, reconfigurable IPP 
har dware co-processor 140 advances to the next command in Command Queue 141 In this 
case, this next command is a computational command using the data just loaded Since this 
computational command cannot start until the previous receive data synchronisation command 
completes, this assures that the correct data has been loaded. 

The combinstkm of the receive data synchronizafioo command and Ehe computational 
command reduces the control burden on digital signal processor core 110 Digital signal 
processor core 110 need only set up direct memory access circuit 120 to make the input data 
transfei and send the pair of commands to reconfiguiable IPP hardwar e co-process or 140 
This would assure that the input data transfer had completed pr ior to beginning the 
computational operation This greatly reduces the amount of software overhead required by 
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the digital signal processor core 110 to control the function of reconfiguisble IPP hardware co- 
processor 140,, Otherwise j digital signal processor core 110 may need to receive an interrupt 
&odi direct memory access circuit 120 signaling the completion of the input data [oad 
operation An interrupt service routine must be initiated to service the interrupt la addition, 
such an inteirupt would require a context switch from the interrupted process to the interrupt: 
service routine, and another contest switch to return from the interrupt Consequently, the 
receive data synchronization command frees up considerable capacity within digital signal 
processor core for more productive use 

Another non-computational command is a send data synchronization command. Ihe 
send data synciixonization command is nearly the inverse of the receive data synchronization 
command, and actually asserts the signal specified Upon reaching the send data 
synchronizadan command, reconfigiuable IPP hardware co-processor 140 asserts a signal 
which then triggers a direct memory access operation This direct memory access operation 
reads data from data memory 145 for storage at another system location This dir ect memory 
access operation may he preset by digital signal processoi core 1 10 and is merely begun upon 
receipt of a signal from recoiifignrable IPP hardware co-pioccssot 140 upon encountering the 
send data synchronization command,, In the case in which direct memory access circuit 120 
supports plural DMA channels, the send data synchronization command must specify the 
hardware signal that would trigger the correct DMA channel foi the output data transfer 
Alternatively, the send data synchronization command may specify the control parameters for 
direct memory access circuit 120, including the DMA channel if more than one channel is 
supported Upon encountering such a send data synchronization command, ^configurable IPP 
hardware co-pioce&sar 140 communicates directly with direct memory access circuit 120 lo set 
up and start an appropriate direct meraoty access operation 

Another possible Eon-computational command is a synchronization completion 
conumnd, actually another application of assen jlgnai command Upon enrountermg a 
SYQcrnosization completion cotnmand, reconfigurable IPP har dware co-processor 140 sends a 
signal to distal signal processor- core 1 10 Upon receiving such a signal, digital signal 
processor core 110 is assured that all prior commands sent to reconOgurahle IPP hardware co- 
processor 140 liave completed Depending upon the application^ it may be better to sense this 
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5 signal via interrupt or by DSP core 110 polling a hardware status register It may also be 
better to queue several operations fox ^configurable IPP hardware co processor 140 using 
send and receive data syncliron&ation commands and then interrupt digital signal processor 
core 1 10 at the end of the queue.. I his may be useM for higher level control functions by 
digital signal processor core 110 following the queued operations by reconfigurable IPP 

10 hardware co-processor 140 The IPP also uses the following other control/synchtoaization 
commands: Sleep; Reset; Wrire_paiameier The write jpaiamecer command is used to 
perform parameter updates Parameters that axe changed frequently can be incorporated into 
commands to be specified on each task Parameters, such as outpnt right shift, additional term 
for rounding, saturation low/high bounds, saturation low/high set values, and operand 

15 size(8/J<5 bit), that are not often changed can be updated using the write_parameter command.. 
I he configurable IPP hardware co-processor supports the following computational 
commands directly: 

- Row/column 8-point DCT/IDCI 

' * - Vectoi addition/subtraction/multiplication 

20 - Scalar-vector addition/subtiaction/xnuUiplic^ion 

- Table lookup 

Sum of absolute differences 

In addition, through extension and special-casing of the above generic computational 
commands, the IPP also supports: 

25 - ZDDCI/IDCI 

- demosaicing by simple interpolation 
chroma subsampling 

- wavelets analysis and reconstruction 
■ coIoe suppression 

30 ■ color conversion 

- memory-to-memory moves 

Each command will include pointeis foi relevant data and coefficient storagefinpui data) 
as well as addresses for output result data Additionally, the number of filter taps, up/down 
sampling factors, the number of outputs produced* and various pointer increment options are 
35 attached to the computational commands Because image processing is the application area T 2 ■ 
D block processing is allowed whenever feasible.. 
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Figure 4 illustrates another possible arrangement of circuit 100 Circuit 100 illustrated 
in Figure 4 includes 2 reconfigutable IPP hardware co-processors. 140 and 180. Digital signal 
processor rare operates with first reconfgurable IPP hardware co-processor 140 and second 
recosifigujnble IPP hardware co-processor ISO A private bus 185 couples first reconfiguiabls 
IPP hardware co-processor 140 to reconfigurable IPP bard ware co-processor 180. These co- 
processors have pr ivate memories sharing the memory space of digital signal processor core 
110 The data can be transferred via pi ivate bus 185 by one co processor writing to the 
address range encompassed by the other co-processor 's memory Alternatively, each co- 
processor may have an output port directed toward an input port of another co-piocessoE with 
the links between co-processors encompassed in private bus 185.. IMs construction may be 
particularly useful foi products in which data flows from one type operation handled by one 
co-processor to another type of operation handled by the second coprocessor Thi? private 
bus frees digital signal processoi 110 from having to handle the data handoff either directly or 
via direct memory access circuit 120 

Alternatively, Figure 5 illustrates digital signal processor core 110 and a reconfigmable 
IPP hardware co processor 140 loosely connected together via system bus 142 Digital signal 
processor coie 1 10 may be of conventional design In the preferred embodiment, 
reconfigurable IPP hardware co-processor 140 is adapted to coordinate with direct memory 
access circuit 120 for autonomous data transfers independent of digital signal processor core 
1 10 Ihe parts illustrated in Figure 5 are not intended to exclude the provision of other 
conventional parts Ihe system level connection in Figure 5 may be useful when the digital 
signal processor core 140 in a particular implementation does not offer connection to its 
internal bus, for example when using catalog devices Data tr ansfer overhead is usually lar ge[ 
when IPP coprocessor 140 is attached to the system bus, yet there is mot e system level 
flexibility, like using multiple DSPs or multiple IPPs in the same system, and relative ease of 
changing or upgrading DSP and IPP. 

As an example of the communication between and the DSP and the IPP, if the DSP is 
instructing the IPP to perform a vector addition task, these are the events that occur from the 
DSP's point of view The DSP sets up the DMA transfer to send data to the IPP I hen the 
DSP sends a wait_until_ signal command to the IPPfthis signal will be asserted by the DMA 
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controller once the transfer is convicted). Mext the DSP sends a vector _ add command to the 
IPP, which frees up the DSP to perform other tasks. Now. either the DSP comes back to 
check on the completion status of the EPP t or alternatively, the DSP can be interrupted upon 
completion of the IPP task upon receipt of a assert signal command, which would follow the 
vectorjuid. command Finally, the DSP sets up the DMA to get the result back from the IPP 
As mentioned previously, as there is some overhead in managing each data transfer and each 
computation command, the functionality of the IPP supports and encour ages block 
computations Another advisable practice is to perform cascaded tasJcs on the IPP for the same 
batches oi data, to reduce data tiansfers , and thus reduce the DSP load as well as the system 
bus load and overall power consumption 

The IPP supports one-dimensionaL row-wise filtering when data is stored in rows 
Certain combinations of upsampling and downsampling are supported as well For example, 
the following 5 methods implement various up/down sampling options and constraints on filter 
length Only configurations A and D (Figures 8 and 12) are considered here; there are many 
more methods in a fully reconfignrable IPP datapath (Figure 13). 



Method 


a) no 

up/down 

sampling 


b) u/s up 
sample in 
space-time 


c)up 
sample in 
space 


d) - down 
sample in 
space 


e) up 

sample 
inspace-time 


Configuration 


A(S 


A(S 


A(S 


D (quad 2- 


D (quad 2- 




MACs) 


MACs) 


MACs) 


trees) 


trees) 


Filter taps (Util-1) 


Any 


any 


any 


Even 


even 


Up sampling factor 


1 


8, 16, 24 


2, 4,8 


1 


4, 8, 12 


Downsampl factor 


1 


Any 


1 


2 


Any 



Figures 6-15 rlfustx-ate the construction of an exemplar y reconfirm able IPP hardware 
co-processor with Figures S and 10-15 illustrating various Datapath configurations Figure 6 
illustrates the overall block: diagram general architecture of leconfigurable IPP hardware 
copr ocessor 140 according to a preferred embodiment of the invention On the host's memory 
map, the IP? interface should appear as large contiguous memory blocks, for coefficients, data 
and m^cro- commands, and also as discrete control/status registers, for configuration, command 
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queue, run- time control, etc The coofiguration/coaamaa^ queue registers may veiy well sit on 
the host's DSP external bus in cither I/O 01 memory address space Multiple write addresses 
(with respect to the host) must be set up to modify less frequently changed parameters in EPP 
such as hardware handsha&e signaling, software reset, and so on One write address for 
commands, li nks to an internal command queue. There are a few additional write addresses 
for clearing interrupts, one for each interrupt There is at least one read address for query of 
cotnmand completion status 

The data portion should map into the host's memory space, if possible If the address 
space is insufficient, address and data ports should be separate, such that writing to the address 
port sees up an initial address, and subsequent read/writes to the data port transfer contiguous 
data from/to the IPP data memory In terms of IPP implementation, buffering is necessary 
between the outside 16/32 bit bus and the internal memory's 128 bit width A small cache can 
be used for that purpose Read ahead technique for reading and writeback for writing can 
reduce the access time Around 512 bits in this buffer, half for read and half for write* should 
be sufficient 

ihree logical memory blocks, data memory A and B and command memory, are 
accessible from a system bus via an external bus interface.. The memoiy interface handles 
memoiy arbitration between the IPP 140 and the system bus 142, as well as simple First -m 
First-Out (FIFO) control involved in matching the system bus access width with the memoiy 
width . Data A and B are foi input/output data and coefficients Cascaded commands can 
reuse areas in the data memory , so the terms input/output are in the context of a single 
command. As previously mentioned, the Command Queue 141 can receive commands from 
the digital signal processor LiO via the digital signal processor bus 142 T and in supplying those 
commands to the Execution Control unit 1,90, control the oper ation of the reconfigurable IPP 
hardware coprocessor 140 The control block steps through die desired memoiy access and 
computation functions indicated by the command. Command memory 141 is read by the 
decode unit 142 lo conserve memory, variable length commands are incorporated Ihe 
decode unit 142 sends the produced control parameters {one set per command) to the execution 
control unit 190, which use the control parameters to drive a pipelines control path to fan out 
the control signals to the appropriate component Control signals can be either fixed or rime 
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varying in a corarnsiid. They include memoiy access requests, input/output formatter contr ol, 
and datapath control 

Data memoiy 145 and coefficient memoiy 147 are wide memoiy blocks (128-bit each) 
fo support an 8-way parallel 16-bit datapath This 128 bit wide memory block precludes the 
data path from having to access memoiy eveiy cycle. I he Data Memory 145 receives relevant 
input data via the DSP bus and also stores the Resultant Data subsequent processing through 
the Datapath core 170 and reformatting in the Output Formatter ISO Coefficient data can also 
be jeceived from toe DSP bus 142, or possibly, provided in a I oolc-Up Table within the IPP 
itself, and along with the input data, be processed through the Datapath core 170 and then 
reformatted in the Output formatter block 1SD Data memory 145 and coefficient memory 147 
may be written to in 128 bit words. This write opeiatian is controlled by digital signal 
piocessoi core 1 10 or diieet memoiy access circuit 120 which, through the use of operand 
pointers in the commands, manage the two memory blocks Address generator 150 generates 
the addresses for recall of Data and Coefficients used by the co processor Ihis read operation 
operates on data words of i2S hits from each memoiy 

I he r ecalled 128 bit data words from Data and Coefficient Memories are supplied to 
input foimattei 160 Input formatter 160 performs various shift and alignment operations 
generally to arrange tihe 128 bit input data words into the order needed for the desired 
computation, Input formatter outputs a 128 bit (8 by 16 bits) Data A, a 128 bit (5 by 16 bits) 
Data B and a 123 bit (8 by 16 bite} Coeff Data 

These three data streams, Data A, Data B, and Coeff Data, are supplied to Datapath 
170 Datapath 1 70 is Che operational portion of the co-processor. Fhe datapath can be 
configured in the run lime to support a variety of image processing tasks Figures 12 and 13 
illustrate two preferred embodiments of the invention. Some tasks can be mapped into both 
configurations h each providing a different pattern of input/output memory access Ihese 
choices offer flexibility in the hand of application piogrammers to balance speed, data memory 
and sometimes power requirements As will be further described below, datapath 170 includes 
plural hardware multiplier and adders that are collectable in various ways to perform a 
variety of multiply- accumulate operations Datapath 170 outputs three adder data streams 
Two of these three are 16 bit data words while one of the three is a 128 bit word(8 by 16 bits) 
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5 These tljxec data streams supply the inputs to output formatter ISO. Outpui foimattei 

180 rearranges the three data streams into sight 123 bit data words for wilting back into the 
memory. The addresses for these two write opeiations are computed by address generator 
150 This rearrangement may take care of alignment oa memoiy word boundaries 

Hie operations of co-processor are under control of control unit 190 Contr ol unit 190 
10 recalls the commands from command queue 141 and provides the corresponding control within 
co-piocessoE 140 

The consti action of input formatter 160 is illustrated in Figure 7 I he two data streams 
Data A and Data B of 128 bits each are supplied to an input of multipfexeis 205 and 307 
Each multiplexer independently selects one input for storage in it's corresponding register, 

1 5 215 and 21 7 respectively Multiplexer 205 may select either one of the input data streams or 
to recycle the contents of register 215 Multiplexer 201 may select eithei the contents of 
register 215 ot to recycle the contents of it's register 211 Multiplexer 207 may select eithei 
the other of the input data streams, or to recycle the contents of register' 21 7 Ihe iowei bits 
of shiftei 221 are supplied from register 1 15 Ihe upper bits of shifiei 221 are supplied by 

SO register 211 Shiftei 221 shifts and selects all 256 of it's input bits and 128 bits are supplied 
to one fii]l/4 way 64b x 2-1 nrcltiplexei 231 and 128 bits are supplied to full/lway/4way 128b 
x 3-1 multiplexer 235 Ihe 128 bit output of multiplexer 231 is stored temporarily in register 
241 and forms the Data A input to datapath 170 Ihe 128 bit output of multiplexer 235 is 
stored temporarily in register 245 and forms the Data B input to datapath 1 70. Ihe output of 

25 multiplexer 207 is supplied directly to a mll/lw/2w/4w 128b x 4-1 multiplexer 237 as well as 
supplied to register 217. Multiplexer 237 selects the entire 123 bits supplied from register 21 7 
and stores clie result in register 247.. This result forms tie coefficient data input to datapath 
170 

As mentioned previously, the three data streams, Data A, Data B, and Coeff Data, are 
30 supplied to Datapath 170 for processing Figure % Hlustrates a Datapath architecture according 
to a fust preferred embodiment of the invention, in which eight Multiply Accumulate Units 
(MACs) are connected in paralldCA' 1 configuration) Ihe multiply-accumulate operation, 
where the sum ot plural products is formed, is widely used hi signal processing, for example, 
in many filter algorithms N multiply accunralatefwhere N- 3 in this example) units arc 
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5 opeiated in parallel to compute N output points.. 1 his configuration is suitable foi a wide- 
memory word that contains multiple pixels, Epical for image processing The feedback loop 
on the final row of adders contain multiple "banks of accumulators to support upsampiing 
According to a preferred enibodrment, each MAC is associated wirb 3 accumulators, and 
Control Unit 190 includes trie necessary addressing mechanism for these accumulators An 
10 accumulator depth of three is chosen in order to support color convex sion, which involves 3 x 
3 matrixing Thus, an accumulator depth of three simplifies implementation for coloi 
conversion 

Figure 9 illustrates the construction of the output formatter ISO illustrated in Figure 6 
The 16 bit dataword outputs of the first and second accumularors within tecorrfiguiable IPP 

1.5 hmdware co-processor 140 (Acc[0] and Acc[l]) rbrm the first two inputs to the output 
formatter ISO, witH trie outputs of all 8 accumulators of recontigutable IPP hardware co- 
processor 140 (Acc[0] 3 Acc[l] ? Acc[2], Acc[3], Accf4], AccJJ], Acc[6], Acc[7]) providing the 
third input to the output foimattei Eight, 16 bit blacks are written to data memoiy 145 
subsequent processing through the multiplexers and registers of output formatter ISO.. 

20 Figure 10 illustrates tie construction of datapath 170 according to a second preferred 

embodiment illustrating a single 8-tree addet configuration^ 1 ]^ configuration) Various 
segments of the Data A and Data B 128 bit(8 x 16 bit) dataword inputs to the datapath 1 70, 
supplied from input formatter 160, are supplied to addeis/subtractors (adders), 310 P 320, 330, 
340, 350, 360, 370 and 3S0 As shown, tbe first 16 Mi datawords, Data A[0] and Data Bfl>] t 

25 which leptesent the left most or most significant bits of the 128 bit output, are coupled to adder 
310, and adder 320, the second 16 bit datawords Data A[l] and Data Bp] are coupled to adder 
.330 and adder 340, the third 16 bit datawords* Data A[2] and Data B{2] are coupled to addet 
350 and addei 360, the fourth 16 bit datawords, Data A[3] and Data B[3] are coupled to adder 
3 70 and adder 380 , The result of this addition or subtraction of the first 16 bit datawords 

30 LirrGUgh fourth rlatawords is stored in pipeline registers 312, 322 , 332, 342, 352 362, 372 and 
382. This result is then multiplied by the Coeff Data, which for this CGOflgmation of IPP> 
consists of the same two 16 bit datawords In other words, with the 8 MAC configuration 
shown in Figure 10, 4 data words and two coefficient words are fed to the hardware, on each 
cycle These same two coefficient words are used in every pair of adders to multiply the input 
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5 data point with, and the products, which are stored in pipeline registers 316, 326, 336, 346 } 
356, 366, 3 76 and 386, are summed in adders 318, 338, 358 and 373 The results of those 
summations are summed in adders 323 and 368, die summations of which ate added in adder 
348 The output of addei 348 is accumulated in accumulator 349 Tlie benefit of this 
configuration is the requirement of only, albeit 8 multipliers, one accumulator to process the 

10 two 128 bit word outputs of input formatter 160. 

Figure 1 1 illustrates the construction of datapath 170 according to a third preferred 
embodiment illustrating a dual 4-tree with tatta fly adder configutationfC configuration") 
Vaiious segments of the Data A and Data B 128 bit(8 x 16 bit) dataword inputs to the datapath 
170 t supplied from input fbnnaiter 160, are supplied to adders/subtracters {adders), 3 ID, 320, 

1 5 330, 340, 35D t 360, 370 and 380. As shown, the first 16 bit datawords, Data A[0] and Data 
B[0] t which represent the left most or most significant bits of the 128 bit output, arc coupled to 
adder 310. the second 16 bit datawords Data A[ 1] and Data B[l] are coupled to adder 320, the 
third 16 bit datawords. Data A[2] and Data B[2] are coupled to adder 330, the fourth 16 Ml 
datawords, Data A[3] and Data B[3] are coupled to adder 340, the fifth 16 bit datawords, Daia 

30 A [4] and Data E[4] are coupled to adder 350, the sixth 16 bit datawords Data A[5] and 

DataB[5] are coupled to adder 360, the seventh 16 bit datawords Data A[6] and Data B[6] are 
coupled to adder 370 and the eighth 16 bit datawords, oi the least significant bits of the 128 bit 
output of input formatter 160, Data A[7] and Data B[7] are coupled to adder 3S0 The result 
of this addition or subtraction of fust 16 bit datawords through eighth datawords is stored in 

25 pipeline registers 312, 322, 332, 342, 352, 362, 372 and 332. Ihis result is then multiplied by 
the Coeff Data, which for this conEguiation of IPP, consists of two 16 bit words,, In other 
words, with the 2 MAC configuration shown in Figures 11,8 datawords and two coefficient 
words ate fed to the hardware, on each cycle. These same two coefficient words ate used in 
eveiy adder /multiplier portion of each MAC unit to multiply the input data point with, and the 

50 products, which are stored in pipeline registers 316, 326, 336, 346, 356, 366, 376 and 3S6, 
are summed in adders 318, 338, 358 and 378 The results of those summations ate summed in 
adders 328 and 368 The summation from addei 32.8 is then subtracted irom the summation 
from adder 368 in subtracter 388 . The output ftom 388 is then accumulated in accumulator 
359 The summadon from addei 368 is then added to the summation from adder 328 in adder 
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348 The output of adder 348 is then accumulated in accumulator 349 The output of adder 
348 is accmnulated in accumulator 349 Ihe benefit of this configuration is the requirement of 
only, albeit 8 multipliers, two accumulators to process the two 128 bit woid outputs of input 
formatter 160. 

Figure 12 illustrates the construction of datapath 1 70 according to a fourth preferred 
embodiment wherein a quad 2-txee adder configuration is illustrated("I) configuiation") 
Various segments of the Data A and Data B 128 bit(8 x 16 bit) dataword inputs to the datapath 
170, supplied from input formatter 160, are supplied to adders/subtraciois (adders), 310, 320 T 
330, 340, 350, 360 370 and 330 Two different input data schemes are envisioned. The first 
scheme provides 8 datawords and 2 coefficient words to the hardware each cycle 
Dovrasampliag of 2x is performed with the filter lag Each pair of MAC units peif otitis two 
QmWplications and accumulates the sum of the products- The second scheme provides 2 
data words and S coefficient words to the hardware each cycle Again, each pair of MAC units 
performs two multiplications, an addition and an accumulation Upsanrpling is performed with 
the 4-way parallelism and optionally with the depth of each accumulator.. 

According to the first scheme, the first 16 bit da&wotds, Data A[D] and Data B[G], 
which represent the left most or most dgnificant bits of the 128 bit output, are coupled to adder 
310, the second 16 bit datawords Data A£l] and Data B[J] axe coupled to adder 320, the third 
16 bit datawords, Data A[2] and Data B[2] are coupled to adder 330 the fourth 16 bit 
datawoids, Data A [3] and Data B[3] are coupled to adder 340, the fifth 16 bit datawords, Data 
A[4] and Data B[4] are coupled to adder 350, the sixth 16 bit datawords Data A [5] and 
DaiaB(5] are coupled to adder 360, the seventh 16 bit datawords Data A J6] and Daia B{6] are 
coupled to addei 3 70 and the eighth 16 bit datawords Data A [7] and Data B(7] are coupled to 
adder 380 Ihe result of this addition or subtraction of fu st bit datawords through eighth 
datawords is stored in pipeline registers 312, 322, 332, 342, 352, 362 , 372 and 382 This 
result Is then multiplied by the Coeff Data, which for this configuration of TPP, consists of two 
16 bit coefficient words In other words, with the quad 2-tree adder configuration shown in 
Figure 12, S datawoids and two coefficient words are fed to the hardware, on each cycle The 
same two coefficient words are used in every p&n of MAC units to multiply the input data 
point with, and the products, which are stored in pipeline registers 316» 326, 336 t 346, 356. 
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5 366, 376 and 386, are summed in adders 3 IS, 338, 358 and 378 Ihe summation from adders 
318, 33S, 358 and 378 are then accumulated in accumulators 319, 339, 359 and 379 The 
benefit of this configuration is the requirement of only., albeit 8 multipliers, four accumulators 
to process the two 128 bit word outputs of input formatter 160 

Figure 13 illustrates the construction of datapath 170 that includes routing and 

10 multiplexing necessary to support the 4 configurations, A, B, C, and D (Figures 8 T 10, 11, and 
12). Various segments of the Data A and Data B 128 bit(8 x 16 bit) dataword inputs to the 
datapath 170, supplied from input formatter 160, are supplied to adders/subtiactois (adders), 
310, 320 t 330, 340, 350, 360, 370 and 380 As shown, the first 16 bit datawords, Data A[Q] 
and Data B[0], which represent the left most or most significant bits of the 128 bit output, arc 

15 coupled to adder 310, the second 16 bit datawords Data A[l] and Data B[l] are coupled to 
addei 320, the third 16 bit datawords, Data A [2] and Data B[2] are coupled to addet 330, the 
fourth 16 bit datawords, Data A[3] and Data B[3] axe coupled to adder 340, the fifth 16 bit 
datawords, Data A[4] and Data B[4j are coupled to adder 350, the sixth 16 bit datawords Data 
A[5] and DataB[5] are coupled to adder 360, the seventh 16 bit datawords Data A[6] and Data 

29 B[6] are coupled to adder 370 and the eighth 16 bit datawords Data Af 7] and Data B[7] are 
coupled to adder 380 The result of this addition oi subtraction of fust bit datawords through 
eighth datawords is stored in pipeline registers 312, 322, 332, 342, 352, 362, 372 and 382 
This result is then multiplied fay the Coeff Data, which for this configuration of IPP t consists 
of the same 16 bit dataword In other words, with the 8 MAC configuration shown in Figures 

25 8 and 13, 8 datawords and one coefficient dataword is fed to the hardware, on each cycle 
This same coefficient dataword is used in every MAC unit to multiply the input dam point 
with, and the products, which are stored in pipeline registers 316, 326, 336, 346, 356, 366, 
376 and 386, are accumulated in adders 318, 328, 338, 348, 358, 368, 378 and 388 

Actually, as shown in the i outing and multiplexing for configurations A/B/C/D diagram 

30 of Figure 13, the products form one input to adders, 318 through 388 The second input to 
adder 31S is formed by the output of multiplexer 319, which has two inputs; the first being the 
product ftom the multiplier 324 and the second being the accumulated sum of adder 318.. 
Addei 323 has multiplexers 325 and 329 on both inputs Multiplexer 325 selects between 
multiplier 324 or the output of addei 318 Multiplexer .329 selects between accumulated result 
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ftom addei 328 itself, ot from the next adder 338. In the 8 MACs configuration {A, Figure 
8), the pair of adders 3 18 and 328 implement sepatate accumulation of products from 
multipliers 314 and 324.. In the quad 2-trees configuration (E, Figure 12), the pair of adders 
318 and 328 implement summation of the products (by 313} then accumulating the sums (by 
328) 

Similarly, the adder pair 338 and 348, the adder pair 358 and 368, and the adder pair 
3 78 and 388 each implement either separate accumulation of pioducts or accrrrnulation of sums 
of 2 products In case of the summed up accumulation supporting quad 2-trees coafigurafcLOQ, 
addeis 348, 368, and 368 produces the final accumulated outputs, just like adder 328 

lo support the dual 4-tree with butterfly configuration (C), multiplexers 319, 339, 359, 
and 379 are selected such that adders 313, 338, 358, and. 378 sums up neighboring pairs of 
pi oduets from the 8 multipliers,, Multiplexers 325 and 329 are selected such that adder 328 
adds up results of adders 318 and 338 3 and thus has the sum from the first 4 multipliers 314, 
324, 334, and 344 Multiplexers 365 and 369 are similarly selected so that adder 363 has the 
sum from the last 4 multipliers 354, 364, 374 and 384 These 2 sums, at addeis 328 and 368, 
are then routed to both adders 348 and 390, which implement the cross add/subtract 
operations Addei 348 performs the addition, and adder i90 perfbirns the subtraction. Results 
from adders 348 and 390 are nest routed to adders 388 and 392, respectively, fot 
accumulation Addeis 388 and 392 produces the rinal pair of outputs 

To support the single 8-tree configuration (B) t all multiplexer configuration fot dual 4- 
tree with butterfly configuration (C) is retained. Adder 348 has the sum from all 8 multipliers, 
and adder 388 has the accumulated result Output of adder 392 is simply ignored.. 

Figure 14 illustrates a Simplified version of ^configurable datapath architecture I his 
simplified architecture supports both the parallel MACs of Figure 8 and the quad 2-trees ot 
Figure 12. As is shown, instead ot the separate adders and multipliers illustrated in figures 3 
and 13, both Data A and Data B inputs ar e applied to both a multiplier and aa adder /subtracter 
(addet) and then the outputs of either the adders or multipliers are selected before going out of 
the multiply/add/subtract blocks 81,0, 820, 830, 840, 850, 860, 870, S3G A more in depth 
illustration of a pail of the MAC units of F igure 14 is shown in Figui e 2& Each MAC unit is 
capable of performing a pipelined single cycle multiply accumulate operation on two inputs 
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D_iop and C Jnp Accumulation of D jnp i C jnp or DJnp - C_inp instead of Djnp * 
C_ inp is also possible, hence the add/subtr acE unit 3 10 placed in parallel with each multiplier 
314 the muldplexsi 610 chooses between the addei/subtractoi 310 output ox the multiplier 
3 14 output Between each pair of MAC units, there is also the quad 2-trees option{indicated 
by the AND gate 710) to add up the pair of results (D_inp */+/- Cjnp), to produce ACC_inp, 
which feeds the accumulating adder 818 

As shown in Fignne 14, both of the above described configurations are implemented. 
Although only 8 adders (excluding those in parallel with mnlipliers) are active at any given 
time, 12 physical adders are used in this design, in ordei to reduce the cost of multiplexing and 
routing, Ihe AND gates 710, 720. 730 and 740 on the cross path control whethex or not the 
*/+/- results should be added together. As shown in Figure 28, three accumulators 612, 614 
and 616 are available in each MAC unit to implement upsamplkg., The accumulator 818 can 
select, via multiplexer' 61S 4 any of the three as input (with the other input being ACC jnp), or 
from the half-unit quantiriy for rounding, RND_ ADD On the very first cycle of valid data on 
ACCjnp, RND ADD should be the selected input 

Rounding and saturation foUow the main arithmetic datapath With die half-unit 
quantity already added to the accumulated stim, rounding h simply a right shift Figure 15 
illustrates a more simplified version of Figure 8 than that illustrated in Figure 14 Ihe 
configuration illustrated in Figure 15 comprises only 4 MAC units versus 8 MAC units 
illustrated in previous configurations and does not contain the pre-add illustrated in Figuies 8- 
14 As illustrated in Figuies 14 and 28, Figure 15 illustrates Data A and Data B inputs applied 
to both a multiplier 314 and an adder/subttactoi (adder) 310 and then the outputs of the adders 
and multipliers are multiplexed togethei in multiplexers 610 and 62Q(Pigure 28) Because 
there is no pr e add T post multiplexing, the outputs of the multiplexers 610 and 620 are 
accumulated in accumulators, 318, 828, 838 and 848. As previously described with reference 
to Figure 14, and as shown in Figure 28, three accumulators 612, 614 and 616 arc available in 
each MAC unit to implement up sampling Ihe accumulator SIS can select, via multiplexer 
618, any of the three as input (with the other input being ACC_mp), or from the half-unit 
quantiriy for rounding, FND_ADD On the very first cycle of valid data on ACC_inp, 
RND_ADD should he the selected input 
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5 In Figures 14 and 15, it is sometimes desirable to add absolute difference operation to 

the multiply/add/subtract block Ibis will speed up motion estimation task in video encoding 
applications 

Figure 16 illustrates the input data formatting necessary to perform the IPP operation of 
low filtering On the first cycle, the Data A input to all 8 MACs comprises Hie fist 8 data 

10 words Every cycle, the window of input data words used to feed the MACs is shifted one 
word to the light Data E input of all 3 MACs is fed the same coefficient word In this 
example, a 3- tap FIR filter is irnpltjmeiited, so three coefficient words are provided 

To the figure, Xo X? comprise the fist Data A input to the MACs during a first clock 
cycle. Shifting by one data word, the second Data A input becomes Xi Xa during a second 

15 clock cycle The Data A inputs continue in this mannei , supplying each MAC with 

consecutive sequence of data woids The fust filter coefficient Co is broadcast to all MACs foi 
the first cycle O is broadcast to all MACs for the second cycle, and Gi foi the third cycle.. 
At the third cycle, the MAC units have accumulated the ccwrect outputs and can write back 
results to data memory The data feed continues at Xs Xi5 to begin to compute output 

20 Y» . Yis, and the coefficient feed wraps back to CO 

Maiatairiiag the same configuration:, an alternative output is rendered when instead of 
supplying 8 data words and one coefficient word to the hardware, providing one data word and 
8 coefficients words foi the S filter banks.. Again each Mac is working independently, 
multiplying the same data word with its specific coefficient woid and iccuma&iing the 

25 products. Upsampiing is performed with the 8-way parallelism and optionally with the depth 
of each accumulator Figure 1 7 illustrates the input data formatting necessary to peifoim a 
symmetric row filtering operation. In this example IPP implements a 3-tap filter, so the first 
and third coefficients are equivalent.. Therefore, only two coefficient words are provided. On 
the first cycle, the Data A input comprises the first 8 data words X* Xr The first Data E 

30 input compi isesdata words Xi X? In addition, the fust coefficient supplied to all the 

multipliers is Co Ihe second Data A input is the first Data A inpni: shifted to the right one 
word, oi Xi Xs The second Data B input is the same 8 data words Coefficient Ci is 
supplied to all the multipliers on the second cycle Effectively, IFF competes 
Co*(Xo + Xi) +■ 2*Ci*Xi on the first MAC, 
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G>*(Xi + ^3 + 2*C]*Xi on the second MAC, 
and so on. Let the desired filter coefficients be Fo t Fi, F2, where Fa - Fs Ihe supplied 
coefficients should relate to the desired coefficients by 

Co = Fo 

Ci » 0,5 *Fi 

At the end of the second cycle, the 3-tap filter outputs are ready to be stoied back to data 
memoiy. On the third cycle, the Data A input is supplied with data words Xs . X15 t Data B 
input is supplied with X10 ..X17 , and coefficient is wr apped back to Co 

Figure IS illustrates where from in memory the data comes to perform a column filter 
operation. The computational model and command syntax is similar to the row filter 
computational model and command syntax, except that data is stoi ed in r ow-major- Older, and 
inner products are performed along columns For besr efficiency, data, coefficient and output 
arrays should all be aligned to a 8 x 16 bit memoiy word As is shown in Figure 18, in this 
case the already aligned data is taken directly from memory word to the datapath. In other 
words, no input formatting of the data is necessary Each coefficient is applied to all 8 MAC 
units in the parallel MACs configuration shown hi Figures 8 and 10 through 11 An N tap 
column filter takes N+ 1 cycles to produce 8 outputs I here are N memory reads and 1 data 
memoiy wiites in each K f 1 cycles When N > 8, there is one coefficient memory read 
every S cycles Otherwise there is an initial read then all subsequent coefficients are supplied 
by the register in input formatter; no further read is needed Coefficient read frequency is the 
same as in row filtering, 1 read/8 cycles if N> 8, and is zero otherwise. 

Figure 19 Diustrates the IPP configuration necessary to perform the sum of absolute 
differences used to enhance the performance of video encoding As shown in Figur e 19, Data 
A comprises Xo „ Xt and Data B comprises Yo Y7 Coefficient words are act required. The 
difference between each Data A input and each Data B input is calculated ha subtracters 310, 
320, 330, 340, 350, 360, 370 and 380 and those differences are stored in registers 312, 322, 
.332, 342, 352, 362, 372 and 382 That difference is then multiplied by eithei a plus 01 a 
minus sign depending upon whether the difference is positive or negative in multipliers 324, 
324, 334, 344, 354, 364, 3 74 and 384, in ordei to yield a positive number Those products 
are stored in registers 316, 326, 336, 346, 356, 366, 376 and 386 then summed in adders 3 1S + 
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5 328, 358 and 378 and those sums summed in adders 328, 348 and 368 The sum of adder 348 
is then accumulated in accumulator 349.. For the sum of absolute differences we operate on 8- 
bit pixels, so the adders only have to be 12 -biEs wide, except for the final accumulator, which 
must be 16 bits wide Saturation thresholds and rounding parameters can come from yet 
another bank of registers 

1 0 Figures 20, 21 and 22 illustrate tie IPP operation of Discrete Sine/Cosine Demosaicing 

including the steps of Row Pass and Column Pass Most digital still cameras employ color 
filter airay in the imagei that produces interleaved color information Demosaicing is the 
process to obtain the missing color component from available neighboring same-coloi 
components Simple linear interpolation approach is often used, which can be represented by 

15 the diagram illustrated in Figure 20 Ihe weights are either 0 5 or 0 25, depending upon 
whether there are 2 or 4 closest same-color neighbors (excluding boundary conditions) 

The three colors are processed separately, with red processing essentially the same as 
blue. Each color is processed in two passes* a row pass and a horizontal pass The row pass 
is graphically represented in Figure 2 1 From each green/red line, one full green line and one 

20 full red line is generated,, For the green component, row pass filtering is implemented by a 2- 
phass, 3^tap filter, with coefficients (0.5, 0, 0 5) and (0, l s 0) foi the two phases For the red 
component, row pass filtering is implemented by the same 2-phase r 3-tap filter, with 
coefficients (0, 1, 0} and (0 5,0,0 5) Each blue/green line is processed similarly to generate 
a full blue line and a full green line 

25 Producing two color output rows from one row should be merged into one command, 

using up -sampling-like looping It takes 6 cycles to piocess 8 input pixels, For each group of 
6 cycles, there is one data, memory read, two data memoiy writes and three coefficient 
memory reads 

The implementation of column pass for demosaic red/blue components is illustrated in 
30 Figuie 22a For read and blue colors, two tap column filtering is used It takes three cycles to 
process 8 input pixels during which there are two data memory reads, 1 data memoiy writes, 
and there are no sieady-state coefficient memoiy reads. 

The implementation of column pass for demcsaic green components is illustrated in 
Figuie 22b For the green color component, 2- phase 3-tap column filtering is used, with 
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coefficients (0,25, 0..5, 0.25} and {0, 1, 0) Eight input pixels are processed in 4 cycles 
There are three data memory reads* one data meinoiy write, and zero coefficient memory 
reads per group of £ cycles 

la sum, 11 cycles are spent for the interpolation scheme of dernosaic for 8 input pixels 
Out of 13 cycles, 6 data memory reads, 4 data memory writes and 3 coefficient memoiy reads 
are performed. 

Figure 23 illustrates tie formatting of the input data to perform the IPP operation of 
wavelets, row pass In image technology, wavelets are used for image 
compression/decompression and featute extraction, for example, as a pre-processing stage for 
textural features. The wavelets operation can "be implemented on any of the parallel 8 MAC 
conflgiuations illustrated in Figures 8 and 1Q 13 oi the more simplified versions of Figures 14 
and 15 The row pass of wavelets analysis is implemented as 2x upsampling, 2x 
downsampling {to achieve high/low frequency banks), row filteiing 

Figure 24 illustrates where from, in memory, the input data comes, in order to perform 
the column pass portion of the wavelet opeialion The column pass is treated as 2x 
upsampllng, 2x downsaropling, column filtering Again, data, coefficient and output airays 
should all be aligned to a 8 x 16 bit memory wotd As is shown in Figure IS, data is taken 
directly from memory word to the datapath In other 1 words, no input formatting of the data is 
necessary Each coefficient is applied to all 8 MAC units in the parallel MACs configuration 
shown in Figures 8 and 10 through 13 or to the four MAC units illustrated in figures 14 and 
15 It taksa N + 1 cycles to produce 8 outputs, where N is the number of filter taps in the 
wavelets kernel There are N rnemoiy reads and 1 data memory writes in each N+ 1 cycles 
Coefficient read frequency is the same as in row filtering, 1 read/8 cycles if N>8, and is zero 
otherwise For wavelet reconstruction, separately process high and low frequency banks with 
2x upsampling filters Finally, combine the two banks using vector addition.. 

Figure 25 illustrates the IPP operation of Indirect Cosine Transform (1DCI) in a row 
pass format. As shown, row-pass IDCT is implemented with the full matrix-vector approach. 
Thiity-two multiplications are used foi each S-point transform Although not seemingly very 
efficient, a straightforward application of the IPP.. Any one of the 8 MAC configurations 
shown in Figures 8 or 10-15 can be used to perform this operation, but the configuration of the 
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5 split addei trees with, butterfly shown in Figui-e 11 Is preferred This configuration can iakc 
advantage of symmetry ia the transform to reduce the number of multiplications by half Id 
tils case the IFF uses the post-mumplyyadders to implement fee cross additions/subtractions 
One input dataword is pulled from the wide memory word pet cycle, and S coefficient words 
are used per cycie Each S-poiat transform takes 4 cycles to process. During these 4 cycles, 
1 0 one data memory read, one data memory write and 4 coefficient memory reads are performed 
If the butterfly stage of reconfiguration h omitted (for example in Figures 14 and 15), the M 
8-by-S matrix multiplication method has to be used, resulting in 64 multiplications per 8 point 
transform, and taking 8 or 16 cycles to perfoim each transform (with 8 or 4 MACs in IPP) 
Figure 26 illustrates the IPP operation of Direct Cosine Transform (DOT) in a row pass 
15 format Similar to tie iow-pass IDCT iw-pass DCT can be implemented with 32 
multiplications oi with 64 multiplications, depending on the configurability of IF? When the 
dual 4-tree with pre-multiply adders configuration (Figure II) is available, it should be used 
The butterfly stage is disabled in this case All S data words from eacn memory word are 
applied to toe MACs, one to eacfc Coefficients are applied the same way, one different 
eoefficiem iq each MAC It takes 4 cycles to process one 8-poim transform in this 
canflgurauou Without the pre-multiply adders (for example in Figures 1.4 and 15), each 8- 
point tcansfofm will require 64 multiplications, and take 8 or 16 cycles depending on the 
number of MACs in the IPP. 

Figure 27 illustrates the IPP operation of IDC I in column foimat Single Instruction 
Multiple Data(SIMD) The parallel configuration of 3 MACs shown m Figures 8 with some 
modifications in the accumulators is needed to take advantage of symmetry in the transform 
Each MAC unit requires 8 accumulators, and each accumulating adder needs to take both 
inputs from the S accumulators With such hardware capability, during the first 4 cycles, one 
4x4 matrix will yield the fust 4 points,, During trie next 4 cycles, another 4x4 matrix will 
pioducc the next 4 points. During cycles 9 and 10, the accumulating adders cross adoVsubtract 
and combine the outputs Therefore, in 10 cycles, a pair of output results, 16 points are 
produced. During those 10 cycles, 8 data reads, 2 data writes and S coefficient reads are 
performed Without the hardware modification, it takes 64 multiplications per 8-point 
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5 transform, so 16 points of output will take 16 cycles on 8- MAC version of IPP, and 32 cycles 
on 4-MAC version of IPP In eitheT esse the sepaiate MAC cmfiguiation is used . 

In addition to the datapath coniiguiabilily and input foiinariing options, an efficient 
control and address generation scheme is demised for IPP. Ibis scheme reduces the 
implementation cost of hardware control, and provides easy- to use programming mode] for 
10 IPP 

AH computation shall occui inside a nested for loop Timing for accumulate! 
initialization and write out shall be controlled by conditioning on the loop variables 
Initialization shall happen when certain loop variables match with their beginning values 
Write out shall happen when the same set of variables match with theh ending values 
1 5 Ciiculating accumulators can be specified with the innermost loop count indexing the 

accumulators All addiess ineiements for input data, coefficients, and results, can be specified 
in terms of "when" and "how much", and the "when" is associated with the loop variables 
The following as psuedo-code of a skeleton of control stcucture foi IPP that illustrates these 
concepts. 

30 

dptr = dptr_init; /* initial value of pointers */ 
cpti - cptr_init; 
optr = optz init; 



25 far ai^O; il<-:.plend; il+H { 

£ox (12=0; 12<=lp2end; i2*+) { 
for (i>0; i3<-lp3end; ±3++] { 
for (i4==0; 14<-lp4end; 14 + *} { 

30 /* memory read and Input formatting V 

x[0 7J = dptr(0, .71; 

/* or dptr[0], dptr[0,lj, dptr [0, 1, 2, 3] distributed */ 
y(G 7] - cpti [0.. 7]; 

/* or cptrfOJ, cptr[0 f 13, ^tc */ 

35 

/* accumulator Initial i zation */ 
if (initiaii2€_acc) 

acc[i4 *accinode] [Q ,7] = i:nd_add[0 7]; 

40 /* operation-accumulate */ 

aec[i4*accntcde] [O. .7] f= x[0 7] op y£0 71; 

/+ write back */ 
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5 if (writeback) 

optr[0 -7] = satuzate_round(acc[i4*sccmocie] f 0 7]J); 
/* oi just 1, 2, or 4 outputs */ 

/* pointer updates */ 
10 dpfcr +- 

cptr *■= . ; 
optr += .. . ; 

} 

15 ) 
J 

} 

The ixutiaii2e_acc condition is tested by matching a specified subset of loop count 

20 variables with the beginning values (0) Ihe parameter accjoop level indicates whether none, 

i4, i4 and i3, or i4, i3 and i2 should be tested . This same subset of loop count variables axe 
tested against their ending values to supply the writeback condition 

The pointer updates also involve comparing loop count variables. Pot example, foi 4 
level of loops we can supply up to 4 sets of address modifiers for the data pointer dptr. Each 
25 set consists of a subset of loop count var iables that must match with then ending value, and the 
amount in which dptr should be incremented when the condition Is true. The same capability 
is given to coefficient pointei cptr and output pointer opt?. 

In the above pseudo code, the parameters are used which are either statically set with 
Write jpaiameters command or are encoded in an 1PP computational command These 
30 parameters includes the ending values of loop count variables (beginning value is always 0), 
accmode (single/circulating accumulator), op (multiply /add/subtract/absdift), acc_ioop_levei 
and the address modifiers mentioned above. 

All the supported imaging/video functions can be written in the above foim and then 
translated into IPP commands by properly setting the parameters. The task of software 
35 development for IPP can follow this rnetfaodalogy 
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T claim: 

1 An image processing peripheiral comprising; 

a plurality of pairs of multiply accumulate circuits connected in parallel, each 
pait of multiply accumulate circuits comprising; 

first adder pairs, each one of each addei pair IiaviDg first and second 
inputs receiving respective first and second inputs having a fxcst predetermined 
number of bits and an output pi educing a sum oi a difference of said inputs- 
first multiplier pairs, corresponding to said first adder pairs, each 
rnultipliei of each multiplier pair having a first input of said sum or difference 
of said fast adders and a second input of a constant predeterrnined number and 
producing a product output; 

second addei pairs, corresponding to said first multiplier pairs, each one 
adder of said adder pair fjaving fust and second inputs receiving respective first 
multiplier outputs from one or the omei of said multipliers of said 
corresponding multiplier pair as said first input and 

wherein said one of said pair of second adders receives an output from a 
multiplexer said multiplexer having one input from a product of the oihei 
multiplier *>f said first multiplier pairs and a second input from an accumulated 
sum of said one adder of said second adder pairs as a second input of said one 
adder of said second adder pair and; 

wherein said othet of said pair- of second adders receives outputs from a 
first and a secocd multiplexer, said first multiplexer having one input from said 
othei multiplier of said first multiplier pair and a second input from the sura of 
said one adder of said second adder pair, said second multiplexer having one 
input from the accumulated sum of said other adder of said second adder pair 
and a second input from the sum of a one adder of a second pair of second adder 
pair s, as a second ouput, and; 

wherein said second adder pairs produce a sum output 
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An image processing peripheral 

a plurality of pairs of multiply accumulate circuits connected in parallel, each 
pair of multiply accumulate circuits comprising; 

first addei pahs, each one adder of each addei pair having fust and 

second inputs receiving respective first and second inputs having a first 

predetermined numbei of bits and an output producing a sum or a difference of 

said inputs; 

first multiplier pairs, corresponding to said fust addei pairs, each 
multipliei of each multiplier pair having a first input of said sum or difference 
of said fust adders and a second input of a constant predetermined number and 
producing a product output; 

second addei pairs, corresponding to said first multiplier pairs, each 
addei pah implementing separate accumulation of said products of said first 
multiplier pairs, yielding an accumulated sum 

An image pioeessing peripheral comprising: 

a plurality of pairs of multiply accumulate circuits connected in parallel, 
each pah of multiply accumulate circuits comprising; 

first addei pairs, each one adder of each addei pair having first and 
second inputs receiving respective fits! and second inputs having a first 
predetermined number of bits and an output producing a sum or a difference of 
said inputs; 

first multiplier pans, corresponding to said fust adder pairs, each 
multiplier of each multiplier pair having a first input of said sum or difference 
of said first adders and a second input of a constant predetermined number and 
producing a product output; 

second adder pairs, corresponding to said first multiplier pairs, each 
adder pair implementing summation of said products of said pairs of multipliers 
and then accumulating the sums of said summations 
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4. An image processing peripheral comprising: 

a plurality of pairs of multiply accumulate circuits connected La parallel, 
each paii of multiply accumulate circuits comprising; 

fiist adder pairs, each one adder of each adder pair baviog Qrst and 
second inputs receiving respective first and second inputs having a fust 
predetenniiied number of bits and an output producing a sum Oi a difference of 
said inputs; 

first multiplier pairs, corresponding to said first adder pairs* each 
multiplier of each multiplier pair having a -first input of said sum oi difference 
of said first adders and a second input of a constant predetennuied number and 
producing a product output; 

second adder pairs, corresponding eg said fltsi multiplier pairs, eacii 
adder pair implementing accumulation of sums of two products 
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FIG 13a 
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FIG. 22a 
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5 ABSTRACT 

The proposed architecture is integrated onto a Digital Signal Processor (DSP) as 
a coprocessor (14G) to assist in the computation of sum of absolute differences, 
symmetrical row/column Finite Impulse Response (FIR) filtering with a downsampling 
(or upsampling) option, raw/column Discrete Cosine Transform (DCT)/lnverse Discrete 

1 0 Cosine Transform {(OCT), and generic aigeb* arc functions. The architecture is called 
iPP, which stands for image processing peripheral, and consists of 8 muffiply- 
accumuiate hardware units connected in parallel and routed and multiplexed together 
The architecture can be dependent upon a Direct Memory Access (DMA) controller 
(120) to retrieve and write back data from/to DSP memory without intervention from the 

15 DSP core (1 10) The DSP can set up the DMA transfer and IPP/DMA synchronization 
in advance, then go on its own processing task. Alternatively, the DSP can perform the 
dats transfers and synchronization itself by synchronizing with the IPP architecture on 
these transfers This architecture implements 2-D filtering, symmetrical filtering, short 
filters, sum of absolute differences, and mosaic decoding more efficiently than the 
^20 previously disclosed architectures of the prior art, 



